asimplenonparametricapproachtoestimatingthedistribution...

A Simple Nonparametric Approach to Estimating the Distributionof Random Coefficients in Structural Models

Jeremy T. FoxRice University & NBER

Kyoo il Kim∗

Michigan State UniversityChenyu Yang

University of Michigan

July 2015

Abstract

We explore least squares and likelihood nonparametric mixtures estimators of the joint distributionof random coefficients in structural models. The estimators fix a grid of heterogenous parametersand estimate only the weights on the grid points, an approach that is computationally attractivecompared to alternative nonparametric estimators. We provide conditions under which the esti-mated distribution function converges to the true distribution in the weak topology on the spaceof distributions. We verify most of the consistency conditions for three discrete choice models.We also derive the convergence rates of the least squares nonparametric mixtures estimator underadditional restrictions. We perform a Monte Carlo study on a dynamic programming model.

Keywords: Random coefficients, mixtures, discrete choices, dynamic programming, sieve estimation

∗Corresponding authors: Jeremy Fox at [email protected] and Kyoo il Kim at [email protected].

1

1 Introduction

Economic researchers often work with models where the parameters are heterogeneous across thepopulation. A classic example is that consumers may have heterogeneous preferences over a set ofproduct characteristics in an industry with differentiated products. These heterogeneous parametersare often known as random coefficients. When working with cross sectional data, the goal is oftento estimate the distribution of heterogeneous parameters. This distribution captures the essentialheterogeneity that is key to explaining the economic phenomena under study.

This paper studies estimating the distribution of heterogeneous parameters F (β) in the model

Pj (x) =

∫gj (x, β) dF (β) , (1)

where j is the index of the jth out of J finite values of the outcome y, x is a vector of observedexplanatory variables, β is the vector of heterogeneous parameters, and gj (x, β) is the probability thatthe jth outcome occurs for an observation with heterogeneous parameters β and explanatory variablesx. Given this structure, Pj (x) is the cross sectional probability of observing the jth outcome whenthe explanatory variables are x. The researcher picks gj (x, β) as the underlying model, has an i.i.d.sample of N observations (yi, xi), and wishes to estimate F (β). As F is only restricted to be a validCDF, the mixture model (1) is nonparametric.

Previous estimators for F (β), whether parametric or nonparametric, include the EM algorithm,Markov Chain Monte Carlo (MCMC), simulated maximum likelihood, simulated method of moments,and minimum distance. As typically implemented, these estimators suffer from a key flaw: they arecomputationally challenging. The researcher must code some iterative search or simulation procedure.Often, convergence may not be to the preferred, global solution. Convergence may be hard to ensureand to detect.

The unknown distribution F (β) enters (1) linearly. Thus, we can exploit linearity and achieve acomputationally simpler estimator than some alternatives. Bajari, Fox and Ryan (2007) and Fox, Kim,Ryan and Bajari (2011), henceforth FKRB, propose dividing the support of the vector β into a finiteand known grid of vectors β1, . . . , βR. Let yi,j equal 1 when the observed outcome for observation i isj and 0 otherwise. In FKRB, the researcher then estimates the weights θ1, . . . , θR on the R grid pointsas the linear probability model regression of yi,j on the R predicted probabilities gj (x, βr). We alsoimpose the constraints that each θr ≥ 0 and that

∑Rr=1 θ

r = 1. Thus, the estimator of the distributionF (β) with N observations and R grid points becomes

FN (β) =∑R

r=1θr1 [βr ≤ β] ,

where θr’s denote estimated weights and 1 [βr ≤ β] is equal to 1 when βr ≤ β. The computationaladvantages of the procedure are immediate. Computationally, the estimator is linear regression (leastsquares) subject to linear inequality and equality constraints. This optimization problem is globallyconvex and specialized routines (such as MATLAB’s command lsqlin) are guaranteed to converge tothe global optimum. One can impose that the estimate yields a continuous density by making theapproximation a finite mixture of say normal densities rather than a finite mixture of mass points.

2

FKRB highlight the practical usefulness of their least squares estimator by showing how it can be usedin a series of examples of heterogeneous parameter models employed in consumer choice and othersettings, including applications with endogenous explanatory variables handled with instruments andapplications with aggregate data.

Citing Bajari, Fox and Ryan (2007), Train (2008, Section 6) similarly suggests fixing a grid ofvectors β1, . . . , βR and estimating only the weights θ1, . . . , θR on the R grid points. The difference isthat Train suggests using a likelihood criterion and, computationally, maximizing the likelihood usingthe EM algorithm. Computationally, the EM algorithm for a model with a fixed grid β1, . . . , βR avoidsnumerical optimization routines altogether. Alternatively, it is simple to implement a gradient searchmethod, as we discuss below. The resulting likelihood is globally concave.

Both the likelihood and linear probability model estimators have computational savings for complexstructural models where economic models must be solved as part of the estimation procedure. Ina dynamic programming application such as adding heterogeneous parameters to Rust (1987), thedynamic programming model must be solved only R times, once for each heterogeneous parameter βr.These solutions occur before optimization commences and are not nested inside an iterative search orsimulation procedure. This contrasts with competing approaches, where multiple dynamic programsmust be solved for every change in the estimation algorithm’s guess of F (β). In this respect, fixed gridestimators share some computational advantages with the parametric approach in Ackerberg (2009).

A serious limitation is that the analyses in FKRB and Train assume that the R grid points used in afinite sample are indeed the true grid points that take on nonnegative support in the true F0 (β). Thus,the true distribution F0 (β) is assumed to be known up to a finite number of weights θ1, . . . , θR. Thisassumption is convenient as the estimator is consistent under standard conditions for the consistencyof least squares and maximum likelihood methods under inequality and equality constraints (Andrews2002). As economists often lack convincing economic rationales to pick one set of grid points overanother, assuming that the researcher knows the true distribution up to finite weights is unrealistic.

This paper seeks to place the appealing, computationally simple fixed grid estimators on firmertheoretical ground. Instead of assuming that the distribution is known up to weights θ1, . . . , θR, werequire the true distribution F0 (β) to satisfy much weaker restrictions. In particular, the true F0 (β)

can have any of continuous, discrete and mixed continuous and discrete supports. The prior approachesin FKRB and Train are parametric as the true weights θ1, . . . , θR lie in a finite-dimensional subset ofa real space. Here, the approach is nonparametric as the true F0 (β) is known to lie only in theinfinite-dimensional space of multivariate distributions on the space of heterogeneous parameters β.

In a finite sample of N observations, our estimators are still implemented by choosing a fixed grid ofpoints θ1, . . . , θR , ideally to trade off bias and variance in the estimate FN (β). We, however, recognizethat as the sample increases, R and thus the fineness of the grid of points should also increase in orderto reduce the bias in the approximation of F (β). We write R (N) to emphasize that the number ofgrid points (and implicitly the grid of points itself) is now a function of the sample size. The maintheorem in our paper is that, under restrictions on the economic model and an appropriate choice ofR (N), our least squares and likelihood estimators FN (β) converge to the true F0 (β) as N → ∞, ina function space. We use the Lévy-Prokhorov metric, a common metrization of the weak topology onthe space of multivariate distributions.

We recognize that the nonparametric versions of our estimators are special cases of sieve estimators

3

(Chen 2007). Sieve estimators estimate functions by increasing the flexibility of the approximatingclass used for estimation as the sample size increases. A sieve estimator for a smooth function mightuse an approximating class defined by a Fourier series, for example. As we are motivated by practicalconsiderations in empirical work, our estimators’ choice of basis, a finite grid of points, is justifiedby the estimators’ computational simplicity. Further and unlike a typical sieve estimator, we need toconstrain our estimated functions to be valid distribution functions. Our constrained least squares andlikelihood approaches are both computationally simple and ensure that the estimated CDFs satisfy thetheoretical properties of a valid CDF.

Because our estimators are sieve estimators, we prove their consistency by satisfying high-levelconditions for the consistency of sieve extremum estimators, as given in an appendix lemma in Chenand Pouzo (2012). We repeat this lemma and its proof in our paper so our consistency proof is self-contained. Our fixed grid estimators are not a special case of the two-step sieve estimators exploredusing lower-level conditions in the main text of Chen and Pouzo. Note that under the Lévy-Prokhorovmetric on the space of multivariate distributions, the problem of optimizing the population objectivefunction over the space of distributions turns out to be well posed under the definition of Chen (2007).Thus, our method does not rely on a sieve space to regularize the estimation problem to address theill-posed inverse problem, as much of the sieve literature focuses on.

Our fixed grid approach is a general, nonparametric mixtures estimator for a model with a finitenumber of outcome values. The most common frequentist, nonparametric estimator is nonparametricmaximum likelihood or NPMLE (Laird 1978, Böhning 1982, Lindsay 1983, Heckman and Singer 1984).While a fixed grid likelihood might be viewed as a computational approximation to the NPMLE, inthis paper we view our fixed grid likelihood estimator as a separate estimator. The key difference fromour fixed grid approach then becomes that NPMLE optimizes over both the grid of heterogeneousparameters and the weights. For NPMLE, often the EM algorithm is used for computation (Dempster,Laird and Rubin 1977), but this approach is not guaranteed to find the global maximum unless, ofcourse, the likelihood is globally concave, as in our fixed grid likelihood estimator’s case. The literatureworries about the strong dependence of the output of the EM algorithm on initial starting values as wellas the difficulty in diagnosing convergence (Seidel, Mosler and Alker 2000, Verbeek, Vlassis and Kröse2002, Biernacki, Celeux and Govaert 2003, Karlis and Xekalaki 2003).1 Further, the EM algorithmhas a slow rate of convergence even when it does converge to a global solution (Pilla and Lindsay2001, Varadhan and Roland 2008). Li and Barron (2000) introduce another alternative, but again ourfixed grid approach is computationally simpler. Our estimator is also computationally simpler than theminimum distance estimator of Beran and Millar (1994), which in our experience often has an objectivefunction with an intractably large number of local minima. The fixed grid idea (called the “histogram”approach) is found outside of economics in Kamakura (1991), who uses a fixed grid to estimate anideal-point model. He does not discuss the nonparametric statistical properties of his approach. Ofcourse, mixtures themselves have a long history in economics, such as Quandt and Ramsey (1978).

We prove the consistency of our estimators for the distribution of heterogeneous parameters, infunction space under the weak topology. We present separate theorems for mixtures of discrete gridpoints and mixtures of continuous densities with a grid of points over the parameters of each density.

1Another drawback of NPMLE that is specific to mixtures of normal distributions, a common approximating choice,is that the likelihood is unbounded and hence maximizing the likelihood does not produce a consistent estimator. Thereis a consistent root but it is not the global maximum of the likelihood function (McLachlan and Peel 2000).

4

The theorem for the mixture of grid points requires the heterogeneous parameters to lie in a, not nec-essarily known, compact set. The theorem for a mixture of continuous densities allows for unboundedsupport of the heterogeneous parameters. Our consistency theorems are not specific to the economicmodel being estimated.

Despite the presence of nonparametric estimators in the literature, they are not commonly used byapplied practitioners estimating distributions of heterogeneous parameters. By proving the nonpara-metric consistency of a computationally simple estimator for general economic choice models, we hopethat nonparametric methods will be increasingly adopted by practitioners in industrial organization,marketing and other applied fields.

We provide the rate of convergences for a subset of the models handled by our consistency theorem,namely those that are differentiable in the heterogeneous parameters. The convergence rates, theasymptotic estimation error bounds, consist of two terms: the bias and the variance. While obtainingthe variance term is rather standard in the sieve estimation literature, deriving the bias term dependson the specific approximation methods (e.g., power series or splines). Because our use of approximatingfunctions is new in the sieve estimation literature, deriving the bias term is not trivial. We provide thebias term, which is the smallest possible approximation error of the true function using sieves for theclass of models we consider.

Our rate of convergence results highlight an important practical issue with any nonparametric es-timator: there is a curse of dimensionality in the dimension of the heterogeneous parameters. Largersample sizes will be needed if the vector of heterogeneous parameters has more elements. Further,the rate results indicate that our baseline estimator is not practical when there is a large number ofheterogeneous parameters. In high dimensional settings, we suggest allowing heterogeneous parameterson only a subset of explanatory variables and estimating homogenous parameters on the remainingexplanatory variables. We extend our consistency result to models where some parameters are ho-mogeneous. However, including homogeneous parameters requires nonlinear optimization, which losessome of the computational advantages of our estimators.

Another limitation of our estimators is that we have to determine a high dimensional tuning param-eter, the grid of heterogeneous parameters, before we proceed with estimation. FKRB develop practicalideas for determining the grid, such as a location-scale model and a cross-validation approach. However,these methods require additional computational time.

We finally provide a Monte Carlo study. We estimate a dynamic programming, discrete choicemodel, adding heterogeneous parameters to the framework of Rust (1987). The dynamic programmingproblem must be solved once for each realization of the heterogeneous parameters. We present resultsfor both the fixed grid likelihood and least squares estimators as well as, for comparison, a likelihoodestimator where we estimate both the grid of points and the weights on those points. We show thatour fixed grid estimators have superior speed but inferior statistical accuracy compared to the moreusual approach of estimating a flexible grid.

The outline of our paper is as follows. Section 2 presents three examples of discrete choice mixturemodels. Section 3 introduces the estimation procedures. Section 4 demonstrates consistency of ourestimators in the space of multivariate distributions. Section 5.1 extends our consistency results tomodels with both heterogeneous parameters and homogeneous parameters and Section 5.2 considersmixtures of smooth basis densities. Section 6 verifies most of the primitive conditions for consistency

5

established in Section 4 using the three examples of mixture models in Section 2. Section 7 derives theconvergence rates of the nonparametric estimator for a class of models. Finally, Section 8 presents theMonte Carlo study.

2 Examples of Mixture Models

In our framework, the object the econometrician wishes to estimate is F (β), the distribution of thevector of heterogeneous parameters β. One definition of identification is that a unique F (β) solves(1) for all x and all outcomes j = 1, . . . , J . This is the definition used in certain relevant papers onidentification in the statistics literature, for example Teicher (1963).

We will return to these three examples of discrete choice models later in the paper. Each exampleconsiders economic models with heterogeneous parameters that play a large role in empirical work.Some of the example models are nested in others, but verification of the conditions for consistency inSection 6 will use additional restrictions on the supports of x and β that are non-nested across models.

Example 1. (logit) Let there be a multinomial choice model such that y is one of J unordered choices,such as types of cars for sale. For j ≥ 2, the utility of choice j to consumer i is ui,j = x′i,jβi+εi,j , wherexi,j is a vector of observable product characteristics of choice j and the demographics of consumer i, βiis a vector of random coefficients giving the marginal utility of each car’s characteristics to consumeri, and εi,j is an additive, consumer- and choice-specific error. There is an outside good 1 with utilityui,1 = εi,1. The consumer picks choice j when ui,j > ui,h ∀h 6= j. The random coefficients logit modeloccurs when εi,j is known to have the type I extreme value distribution. In this example, (1) becomes,for j ≥ 2,

Pj (x) =

∫gj (x, β) dF (β) =

∫ exp(x′jβ

)1 +

∑Jh=2 exp

(x′hβ

)dF (β) ,

where x = (x2, . . . , xJ). A similar expression occurs for other choices h 6= j. Compared to priorempirical work using the random coefficients logit, our goal is to estimate F (β) nonparametrically.

Example 2. (binary choice) Let J = 2 in the previous example, so that there is one inside goodand one outside good. Thus, the utility of the inside good 2 is ui,2 = εi + x′iβ2,i, where βi = (εi, β2,i) isseen as one long vector and εi supplants the logit errors in Example 1 and plays the role of a randomintercept. The outside good 1 has utility ui,1 = 0. In this example, (1) becomes, for j = 2,

P2 (x) =

∫g2 (x, β) dF (β) =

∫1[ε+ x′β2 ≥ 0

]dF (β) ,

where 1 [·] is the indicator function equal to 1 if the inequality in the brackets is true. Without logiterrors, the joint distribution of both the intercept and the slope coefficients is estimated nonparamet-rically. In this example, g2 (x, β) is discontinuous in β.

6

Example 3. (multinomial choice without logit errors) Consider a multinomial choice modelwhere the distribution of the previously logit errors is also estimated nonparametrically. In this case,the utility to choice j ≥ 2 is ui,j = x′i,j βi + εi,j and the utility of the outside good 1 is ui,1 = 0. The

notation βi is used because the full heterogeneous parameter vector is now βi =(βi, εi,2, . . . , εi,J

),

which is seen as one long vector. We will not assume that the additive errors εi,j are distributedindependently of βi or of each other. In this example, (1) becomes, for j ≥ 2,

Pj (x) =

∫gj (x, β) dF (β) =

∫1[x′j β + εj ≥ max

{0, x′hβ + εh

}∀h 6= j, h ≥ 2

]dF (β) .

3 Estimator

We analyze both least squares (linear probability models) and maximum likelihood criteria. We firstdiscuss the least squares criterion, from FKRB. Recall that yi,j is equal to 1 whenever the outcome yifor the ith observation is j, and 0 otherwise. Start with the model (1) and add yi,j to both sides whilemoving Pj (x) to the right side. For the statistical observation i, this gives

yi,j =

∫gj (xi, β) dF (β) + (yi,j − Pj (xi)) . (2)

By the definition of Pj (x), the expectation of the composite error term yi,j − Pj (x), conditional onx, is 0. This is a linear probability model with an infinite-dimensional parameter, the distributionF (β). We could work directly with this equation if it was computationally simple to estimate thisinfinite-dimensional parameter while constraining it to be a valid CDF.

Instead, we work with a finite-dimensional sieve space approximation to F . In particular, we letR (N) be the number of grid points in the grid BR(N) =

(β1, . . . , βR(N)

). A grid point is a vector if

β is a vector, so R (N) is the total number of points in all dimensions. The researcher chooses BR(N).Given the choice of BR(N), the researcher estimates θ =

(θ1, . . . , θR(N)

), the weights on each of the

grid points. With this approximation, (2) becomes

yi,j ≈R(N)∑r=1

θrgj (xi, βr) + (yi,j − Pj (x)) . (3)

We use the ≈ symbol to emphasize that (3) uses a sieve approximation to the distribution functionF (β). Because each θr enters yi,j linearly, we estimate

(θ1, . . . , θR(N)

)using the linear probability

model regression of yi,j on the R “regressors” zri,j = gj (xi, βr).

To be a valid CDF, θr ≥ 0 ∀ r and∑R(N)

r=1 θr = 1. Therefore, the estimator is

θ = arg minθ

1

NJ

∑N

i=1

∑J

j=1

yi,j − R(N)∑r=1

θrzri,j

2

(4)

subject to θr ≥ 0∀ r = 1, . . . , R (N) andR(N)∑r=1

θr = 1.

7

There are J “regression observations” for each statistical observation (yi, xi). This minimization prob-lem is a quadratic programming problem subject to linear inequality constraints. The minimizationproblem is convex and routines like MATLAB’s lsqlin guarantee finding a global optimum. One canconstruct the estimated cumulative distribution function for the heterogeneous parameters as

FN (β) =∑R(N)

r=1θr1 [βr ≤ β] .

Thus, we have a structural estimator for a distribution of heterogeneous parameters in addition to aflexible method for approximating choice probabilities.

Following Train, we can also use the log-likelihood criterion (divided by the sample size)

L =N∑i=1

log

R(N)∑r=1

θrzri,yi

/N,

where the zri,yi are the probabilities computed above for the observed outcome yi for observation i. As

with the least squares criterion, we enforce the constraints θr ≥ 0∀ r = 1, . . . , R (N) and∑R(N)

r=1 θr = 1.Computationally, one can use the EM algorithm in Train’s Section 6 or a nonlinear, gradient-basedsearch routine, which is available in most scientific packages. The performance of the gradient-basedsearch routine will be improved if the gradient of the likelihood is provided to the solver in closed form.The sth element of that gradient is

∂L

∂θs=

N∑i=1

zsi,yi∑R(N)r=1 θrzri,yi

/N.

The log likelihood is globally concave and any local maximum found will be the global maximum.The fixed grid approach, whether based on a least squares or a likelihood criterion, has two main

advantages over other approaches to estimating distributions of heterogeneous parameters. First,the approach is computationally simple: we can always find a global optimum and, by solving forzri,j = gj (xi, β

r) before optimization commences, we avoid many evaluations of complex structuralmodels such as dynamic programming problems. Second, the approach is nonparametric. In the nextsection, we show that if the grid of points is made finer as the sample size N increases, the estimatorsFN (β) converge to the true distribution F0. We do not need to impose that F0 lies in known parametricfamily.

On the other hand, a disadvantage is that the estimates may be sensitive to the choice of tuningparameters. While most nonparametric approaches require choices of tuning parameters, here thechoice of a grid of points is a particularly high-dimensional tuning parameter. FKRB propose cross-validation methods to pick these tuning parameters, including the number of grid points, the supportof the points, and the grid points within the support.

Example. 1 (logit) For the logit example,exp(x′i,jβr)

1+∑Jh=2 exp(x′i,hβr)

= gj (xi, βr). To implement the least

squares estimator, for each statistical observation i, the researcher computes R · J probabilities zri,j =exp(x′i,jβr)

1+∑Jh=2 exp(x′i,hβr)

. This computation is done before optimization commences. The outcome for choos-

8

ing the outside good 1 does not need to be included in the objective function, as∑J

j=1 gj (xi, βr) = 1.

To implement the likelihood estimator, the researcher computes R probabilities zri,yi for each statisticalobservation i.

Remark 1. FKRB discuss the case of panel data on T periods. Let yTi = (yi,1, . . . , yi,T ) be theactual sequence of T outcomes for panel observation i. Similarly, let the explanatory variables be(xi,1, . . . , xi,T ). The likelihood criterion for panel data is

L =N∑i=1

log

R(N)∑r=1

θrzri,yTi

/N,

where zri,yTi

=∏Tt=1 gyi,t (xi,t, β

r). We do not explore panel data in our theoretical results.

4 Consistency in Function Space

Assume that the true distribution function F0 lies in the space F of distribution functions on thesupport B of the heterogeneous parameters β. We wish to show that the estimated distributionfunction FN (β) =

∑R(N)r=1 θr1 [βr ≤ β] converges to the true F0 ∈ F as the sample size N grows large.

To prove consistency, we use the results for sieve estimators developed by Chen and Pouzo (2012),hereafter referred to as CP. We define a sieve space to approximate F as

FR =

{F | F (β) =

∑R

r=1θr1 [βr ≤ β] , θ ∈ ∆R ≡

{(θ1, . . . , θR

)′ | θr ≥ 0,∑R

r=1θr = 1

}},

for a choice of grid BR =(β1, . . . , βR

)that becomes finer as R increases. We require FR ⊆ FS ⊂ F for

S > R, or that large sieve spaces encompass smaller sieve spaces. The choice of the grids and R (N)

are up to the researcher; however consistency will require conditions on these choices.Based on CP, we prove that the estimator FN converges to the true F0. In their main text, CP study

sieve minimum distance estimators that involve a two-stage procedure. Our estimator is a one-stagesieve least squares estimator (Chen, 2007) and so we cannot proceed by verifying the conditions in thetheorems in the main text of CP. Instead, we show its consistency based on CP’s general consistencytheorem in their appendix, their Lemma A.2, which we quote in the proof of our consistency theoremfor completeness. As a consequence, our consistency proof verifies CP’s high-level conditions for theconsistency of a sieve extremum estimator.

First, we consider the least squares criterion and then the likelihood criterion follows. Let yi denotethe J×1 finite vector of binary outcomes (yi,1, . . . , yi,J) and let g(xi, β) denote the corresponding J×1

vector of choice probabilities (g1 (xi, β) , . . . , gJ (xi, β)) given xi and the heterogeneous parameter β.Then we can define our sample criterion function for least squares as

QN (F ) ≡ 1

NJ

∑N

i=1

∥∥∥∥yi − ∫ g(xi, β)dF (β)

∥∥∥∥2

E

=1

NJ

∑N

i=1

∥∥∥∥yi −∑R

r=1θrg(xi, β

r)

∥∥∥∥2

E

(5)

9

for F ∈ FR(N), where || · ||E denotes the Euclidean norm. We can rewrite our estimator as

FN = argminF∈FR(N)QN (F ) + C · νN (6)

where we can allow for some tolerance (slackness) of minimization, C · νN , that is a positive sequencetending to zero as N gets larger, if necessary.

Also let

Q(F ) ≡ E

[∥∥∥∥y − ∫ g (x, β) dF (β)

∥∥∥∥2

E

/J

]be the population objective function.

As a distance measure for distributions, we use the Lévy-Prokhorov metric, denoted by dLP(·),which is a metrization of the weak topology for the space of multivariate distributions F . The Lévy-Prokhorov metric in the space of F is defined on a metric space (B, d) with its Borel sigma algebraΣ(B). We use the notation dLP(F1, F2), where the measures are implicit. This denotes the Lévy-Prokhorov metric dLP(µ1, µ2), where µ1 and µ2 are probability measures corresponding to F1 and F2.The Lévy-Prokhorov metric is defined as

dLP (µ1, µ2) = inf {ε > 0 | µ1 (C) ≤ µ2 (Cε) + ε andµ2 (C) ≤ µ1 (Cε) + ε for all Borel measurableC ∈ Σ(B)} ,

where C ⊆ B and Cε = {b ∈ B | ∃ a ∈ C, d (a, b) < ε}. The Lévy-Prokhorov metric is a metric, so thatdLP (µ1, µ2) = 0 only when µ1 = µ2. See Huber (1981, 2004).

The following assumptions are on the economic model and data generating process. We writeP (x, F ) =

∫g(x, β)dF (β).

Assumption 1.

1. Let F be a space of distribution functions on a finite-dimensional real space B. F is compact inthe weak topology and contains the true F0.

2. Let ((yi, xi))Ni=1 be i.i.d.

3. Let β be independently distributed from x.

4. Assume the model g (x, β) is identified, meaning that for any F1 6= F0, F1 ∈ F , the set X ⊆ Xwhere P (x, F0) 6= P (x, F1) has a positive measure in X .2

5. Q (F ) is continuous on F in the dLP (·, ·) metric.

Assumption 1.1, the compactness of F is satisfied if B itself is compact in Euclidean space (Parthasarathy1967, Theorem 6.4). Unfortunately, the compactness of B rules out some examples such as normal dis-tributions of heterogeneous parameters. In part to address this, Section 5.2 provides a consistencytheorem for a related estimator, which can use mixtures of normal distributions, where the supportof the heterogeneous parameters is allowed to be RK . Assumptions 1.2 and 1.3 are standard fornonparametric mixtures models with cross-sectional data.

2This is with respect to the probability measure of the underlying probability space. This probability is well definedwhether x is continuous, discrete or some elements of x are functions of other elements (e.g. polynomials or interactions).

10

Assumption 1.4 requires that the model be identified at a set of values of x that occurs with positiveprobability. The assumption rules out so-called fragile identification that could occur at values of xwith measure zero (such as identification at infinity). Assumptions 1.4 and 1.5 need to be verified foreach economic model g (x, β). We will discuss these assumptions for our three examples in Section 6.

Remark 2. Assumption 1.5 is satisfied when g(x, β) is continuous in β for all x because in this caseP (x, F ) is also continuous on F for all x in the Lévy-Prokhorov metric. Then by the dominated conver-gence theorem, the continuity of Q (F ) in the dLP(·, ·) metric follows from the continuity of P (x, F ) onF for all x and P (x, F ) ≤ 1 (uniformly bounded). Here the continuity of P (x, F ) on F means for anyF1 ∈ F and such that dLP(F1, F2) → 0 it must follow that

∣∣∫ gj(x, β)dF1(β)−∫gj(x, β)dF2(β)

∣∣ → 0

for all j. This holds by the definition of weak convergence when g(x, β) is continuous and boundedand because the Lévy-Prokhorov metric is a metrization of the weak topology.

Remark 3. If the support B is a finite set, the continuity of Q (F ) holds even when gj(x, β) is dis-continuous, because in this case the Lévy-Prokhorov metric becomes equivalent to the total variationmetric (see Huber 1981, p.34). This implies∣∣∣∣∫ gj(x, β)dF1(β)−

∫gj(x, β)dF2(β)

∣∣∣∣→ 0

for any F1, F2 ∈ F such that dLP(F1, F2) → 0, in part because gj(x, β) is bounded between 0 and 1.We do not have a counterexample to continuity when B is not a finite set. Note that our consistencyresult for a mixtures of parametric densities, presented in Section 5.2, has a continuity condition thatis easier to verify for discontinuous gj(x, β), as Section 6 discusses.

In addition to Assumption 1, we also require that the grid of points be chosen so that the grid BRbecomes dense in B in the usual topology on the reals.

Condition 1. Let the choice of grids satisfy the following properties:

1. Let BR become dense in B as R→∞.

2. FR ⊆ FR+1 ⊆ F for all R ≥ 1.

3. R(N)→∞ as N →∞ and it satisfies R(N) logR(N)N → 0 as N →∞.

The first two parts of this condition have previously been mentioned and ensure that the sievespaces give increasingly better approximations to the space of multivariate distributions. Condition1.3 specifies a rate condition so that the convergence of the sample criterion function QN (F ) to thepopulation criterion function Q(F ) is uniform over FR. Uniform convergence of the criterion functionand identification are both key conditions for consistency.

Theorem 1. Suppose Assumption 1 and Condition 1 hold. Then, dLP

(FN , F0

)p→ 0.

See Appendix B.1 for the proof of the forthcoming Theorem 2, which nests Theorem 1.

Remark 4. Appendix A proves that this estimation problem is well posed under the definition of Chen(2007).

11

Next, we consider the distribution function estimator using the log-likelihood criterion. Define thepopulation criterion function and its sample analog, respectively, as

Q(F ) ≡ −QML(F ) ≡ −E

J∑j=1

yi,j logPj(xi, F )

= −E

J∑j=1

yi,j log

∫gj(xi, β)dF (β)

and

QN (F ) ≡ −QMLN (F ) ≡ − 1

N

∑N

i=1

J∑j=1

yi,j logPj(xi, F ) = − 1

N

∑N

i=1

J∑j=1

yi,j log

∫gj(xi, β)dF (β)

for F ∈ FR(N). Then we can obtain the ML estimator as in (6) and denote the resulting estimator byFMLN . We obtain the following corollary for the consistency of this ML estimator

Corollary 1. Suppose Assumption 1 and Condition 1 hold. Further suppose that Pj (x, F ) is boundedaway from zero for all j and F ∈ F . Then, dLP

(FMLN , F0

)p→ 0.

See Appendix B.2 for the proof.

Remark 5. The literature on sieve estimation has not established general results on the asymptoticdistribution of sieve estimators, in function space. However, for rich classes of approximating basisfunctions that do not include our approximation problem, the literature has shown conditions underwhich finite dimensional functionals of sieve estimators have asymptotically normal distributions. Inthe case of nonparametric heterogeneous parameters, we might be interested in inference in the meanor median of β. For demand estimation, say Example 3, we might be interested in average responses(or elasticities) of choice probabilities with respect to changes in particular product characteristics.Let ΠNF0 be a sieve approximation to F0 in our sieve space FR(N). If we could obtain an error boundfor dLP (ΠNF0, F0), we could also derive the convergence rate in the Lévy-Prokhorov metric (Chen2007). If the error bound shrinks fast enough as R(N) increases, we conjecture that we could alsoprove that plug-in estimators for functionals of F0 are asymptotically normal (Chen, Linton, and vanKeilegom 2003).3 Error bounds for discrete approximations are available in the literature for a class ofparametric distributions F , but we are not aware of results for the unrestricted class of multivariatedistributions. For a subset of problems, including the random coefficients logit, we are able to deriveapproximation error bounds. We provide convergence rates for these cases in Section 7.

5 Extensions

5.1 Models with Homogenous Parameters

In many empirical applications, it is common to have both heterogeneous parameters β and finite-dimensional parameters γ ∈ Γ ⊆ Rdim(γ). We write the model choice probabilities as g(x, β, γ) and theaggregate choice probabilities as P (x, F, γ). Here we consider the consistency of estimators for models

3We conjecture that we could prove an analog to Theorem 2 in Chen et al (2003) if we could verify analogs toconditions (2.4)–(2.6) in that paper for our sieve space.

12

with both homogenous parameters and heterogeneous parameters. For conciseness, we state a theoremfor the least squares criterion and omit a corollary for the likelihood criterion.

Remark 6. Estimating a model allowing a parameter to be a heterogeneous parameter when in truththe parameter is homogeneous will not affect consistency if the model with heterogeneous parametersis identified.

Remark 7. Searching over γ as a homogeneous parameter for the least squares criterion requires nonlin-ear least squares. The optimization problem may also not be globally convex. The objective functionmay not be differentiable for our examples where g(x, β, γ) involves an indicator function.

Our estimator for models with homogeneous parameters is defined as (similarly to (6))

(γN , FN ) = argmin(γ,F )∈Γ×FR(N)QN (γ, F ) + C · νN ,

where QN (γ, F ) denotes the corresponding sample criterion function. Q(γ, F ) is the population cri-terion function based on the model g(x, β, γ). An alternative computational strategy is profiling, asin

FN (γ) = argminF∈FR(N)QN (γ, F ) + C · νN for all γ ∈ Γ.

Profiling gives usγN = argminγ∈ΓQN

(γ, FN (γ)

)+ C · νN ,

and therefore FN = FN (γN ). We make the following assumptions to prove consistency for models withhomogenous parameters.

Assumption 2.

1. Let A ≡ Γ × F where F is compact in the weak topology and Γ is a compact subset of Rdim(γ)

and A contains the true (γ0, F0).


3. Let β be independently distributed from x.

4. Assume the model g (x, β, γ) is identified, meaning that for any (γ1, F1) 6= (γ0, F0), (γ1, F1) ∈Γ×F , the set X ⊆ X where P (x, F0, γ0) 6= P (x, F1, γ1) has a positive measure in X .

5. At least one of the following properties holds.

(a) gj (x, β, γ) is Lipschitz continuous in γ for each outcome j.

(b) (i)(γN , FN

)is well-defined and measurable. (ii) For each outcome j, there exists a vec-

tor of known functions h (x, β, γ) = (h1 (x, β, γ) , . . . , hJ (x, β, γ)) such that, for each j,gj (x, β, γ) = 1 [Aj · h (x, β, γ) > 0] for some vector of known constants Aj. (iii) Each of theJ functions hj (x, β, γ) is Lipschitz continuous in γ.

6. Q (γ, F ) is lower semicontinuous in γ, is continuous on F in the weak topology, and is continuousat (γ0, F0).

13

If homogeneous parameters were added in the examples we consider, Assumption 2.5.a would holdfor Example 1, the logit model with random coefficients. Assumption 2.5.b might hold for Examples 2and 3. See the remark below.

We present the consistency theorem for the estimator with homogeneous parameters.

Theorem 2. Suppose Assumption 2 and Condition 1 hold. Then, γNp→ γ0 and FN

p→ F0.

See Appendix B.1 for the proof. Again, we omit a corollary for the likelihood case.

Remark 8. Assumption 2.5.b is designed to handle extensions of Examples 2 and 3 to include homoge-neous parameters. In the extensions of these examples, the homogeneous parameters enter inside indi-cator functions. The non-primitive portion, Assumption 2.5.b.i, requires that the estimator

(γN , FN

)be well-defined and measurable, echoing a condition in Lemma A.2 of Chen and Pouzo (2012), whichour consistency proof relies on. Remark A.1.i.a in CP states that lower semicontinuity of the leastsquares objective function is a sufficient condition for the estimator to be well-defined and measurable.Even though an indicator function for an open set is lower semicontinuous, the least squares objective(or likelihood) function itself might not be lower semicontinuous in γ even if each gj (x, β, γ) is lowersemicontinuous in γ. For example, multiplication by a negative number is enough to change lowersemicontinuity into upper semicontinuity. Therefore, we follow the main text of CP and, for the caseof indicator functions, maintain the non-primitive assumption that the estimator is well-defined andmeasurable. Remark A.1.i.b in CP does not apply to gj (x, β, γ) being indicator functions either. Wehave not explored the three sufficient conditions in Remark A.1.ii in CP, although they also involvevariations on the concept of lower semicontinuity.

5.2 Continuous Distribution Function Estimator

A limitation of the discrete approximation estimator is that the CDF of the heterogeneous parameterswill be a step function. In applied work, it is often attractive to have a smooth distribution of heteroge-neous parameters. In this subsection, we describe one approach to obtain a continuous distribution ordensity function estimator that allows for unbounded supports. Instead of modeling the distribution ofthe heterogeneous parameters as a mixture of point masses, we instead model the density as a mixtureof parametric densities, e.g. normal densities. Approximating a density or distribution function usinga mixture of parametric densities or distributions is popular (e.g. Jacobs, Jordan, Nowlan, and Hinton1991, Li and Barron 2000, McLachlan and Peel 2000, and Geweke and Keane 2007). Our estimator’sadvantage is its computational simplicity.

As a leading case, let a basis r be a normal distribution with mean the K×1 vector µr and standarddeviation the K × 1 vector σr. Let φ (β | µr, σr) denote the joint normal density corresponding to therth basis distribution. Under independent normal basis functions, the joint density for a given r isjust the product of the marginals, or φ (β | µr, σr) =

∏Kk=1 φ (βk | µrk, σrk) . We can also use only a

location mixture with the basis functions φ (β | µr) =∏Kk=1 φ (βk | µrk) or use a multivariate normal

mixture with the basis functions φ (β | µr,Σr), where Σr denotes a variance-covariance matrix. We canalso consider mixtures of other parametric density functions. We use the generic notation φ (β | λr) todenote the rth basis function, where λr is the rth distribution parameter. Let θr denote the probabilityweight given to the rth basis function, φ (β | λr). As in the discrete approximation estimator, the vectorof weights θ lies in the unit simplex, ∆R.

14

There are many ways to perform K-dimensional numerical integration, such as sparse grid quadra-ture (Heiss and Winschel 2008). We focus on simulation for simplicity. To implement our continuousdensity estimator for a given R, make S simulation draws from φ (β | λr) independently of r (i.e. useindependent simulation draws for each λr). Let a particular draw s be denoted as βr,s. We then createthe R “regressors”

zri,j =

∫gj (xi, β)φ (β | λr) dβ ≈ 1

S

S∑s=1

gj (xi, βr,s) .

The ≈ emphasizes the error (possibly quite small) in numerical integration. This numerical integrationstep is done first, before any optimization. We then approximate Pj (xi) as

Pj (xi) ≈R∑r=1

θrzri,j .

Here, we use the ≈ to emphasize both sieve and numerical integration approximations, althoughtypically the sieve approximation will be a larger source of error. We estimate θ using the inequalityconstrained least squares problem as before

θS = arg minθ∈∆R

1

NJ

∑N

i=1

∑J

j=1

(yi,j −

∑R

r=1θrzri,j

)2

. (7)

This is once again inequality-constrained linear least squares, a globally convex optimization problemwith an easily-computed unique solution. The resulting density estimator is fN,S (β) =

∑Rr=1 θ

rSφ (β | λr)

and the distribution function estimator is FN,S (β) =∑R

r=1 θrSΦ (β | λr), where dΦ(·) = φ(·).

5.2.1 Consistency of the Continuous Distribution Function Estimator

We show consistency for the continuous distribution function estimator after imposing additional re-strictions on the data generating process. For this purpose, we restrict the class of the true distributionfunctions to

FM =

{F : F =

∫Φ(β | λ)Pλ(dλ), Pλ ∈ Pλ, dΦ(β | λ) ∈ G =

{dΦ(β | λ) | β ∈ B ⊆ RK , λ ∈ Λ ⊂ Rd

}},

(8)such that any distribution in FM is in truth given by a mixture of parametric distributions in G.Note that we allow for B = RK in FM , thus removing the restriction that B is compact underlyingTheorem 1. Here Pλ denotes a probability measure on Λ, the support of the distribution parameter λ.This means that we assume the true distribution is in the space of possibly continuous mixtures oversome known parametric basis functions. Note, however, that Petersen (1983, as Lemma 3.4 of Zeeviand Meir (1997)) suggests that any density function can be approximated to arbitrary accuracy by aninfinitely countable convex combination of basis densities, including normals, i.e. G can be dense inthe space of continuous density functions. For example Zeevi and Meir (1997) show that G is dense inthe space of all density functions that are bounded away from zero on compact support. Therefore weargue that the class FM can be arbitrary close to the space of any continuous distribution functionswith suitable choices of G. We approximate this possibly continuous mixture using a finite mixture

15

over the same basis functions. Accordingly we construct our sieve space as

FMR =

{F | F =

∑R

r=1θrΦ(β | λr), dΦ(β | λ) ∈ G, θ ∈ ∆R

}.

First consider an estimator ignoring numerical integration error in the regressors

FN (β) =∑R(N)

r=1θrΦ(β|λr), (9)

where θ = arg minθ∈∆R(N)

1NJ

∑Ni=1

∑Jj=1

(yi,j −

∑R(N)r=1 θrzri,j

)2and zri,j =

∫gj(xi, β)dΦ(β | λr) with

a choice of grid ΛR =(λ1, . . . , λR

)on Λ. We consider simulation error later.

Let F0 =∫

Φ(β | λ)Pλ,0(dλ) ∈ FM . Then we can present the consistency result in terms ofestimating the true Pλ,0 because knowing Pλ fully characterizes the true F0 given a researcher’s choiceof Φ(β | λ) (e.g. normal distributions). Note that (9) can be also written in terms of estimating Pλwith the estimator

Pλ,N (λ) =∑R(N)

r=1θr1 [λr ≤ λ]

where the sieve space for Pλ is given by

Pλ,R =

{Pλ | Pλ (λ) =

∑R

r=1θr1 [λr ≤ λ] , θ ∈ ∆R

}.

Therefore FN =∫

Φ(·|λ)dPλ,N (dλ)p→ F0 =

∫Φ(·|λ)dPλ,0(dλ) as long as Pλ,N

p→ Pλ,0 because weassume F0 ∈ FM and because Φ(β|λ) is continuous in λ for all β and therefore F =

∫Φ(·|λ)dPλ(dλ)

is also continuous on Pλ in the Lévy-Prokhorov metric. This facilitates our analysis because we candirectly apply Theorem 1 to Pλ,N under assumptions below. To present the consistency theorem weneed additional notation.

Define a sample criterion function in terms of estimating Pλ,0 as

QN (Pλ) ≡ 1

NJ

∑N

i=1

∥∥∥∥yi − ∫ g(xi, λ)dPλ (λ)

∥∥∥∥2

E

=1

NJ

∑N

i=1

∥∥∥∥yi −∑R

r=1θrg(xi, λ

r)

∥∥∥∥2

E

(10)

for Pλ ∈ Pλ,R(N), where g(x, λr) =∫g(x, β)dΦ(β|λr). Then we can rewrite the estimator Pλ,N as

Pλ,N = argminPλ∈Pλ,R(N)QN (Pλ) + C · νn (11)

for some positive sequence νn tending to zero. Also define the population criterion function as

Q(Pλ) ≡ E

[∥∥∥∥y − ∫ g (x, λ) dPλ (λ)

∥∥∥∥2

E

/J

].

We make the following assumptions, which are similar to those used in Theorem 1

Assumption 3.

1. Let FM be a space of distribution functions generated by Pλ in (8), where Λ is compact. FM

16

contains the true F0.


3. Let both β and λ be independently distributed from x.

4. Assume the model g(x, β) is identified, meaning that for any Pλ,1 6= Pλ,0, the set X ⊆ X whereP (x, F0) 6= P (x, F1) such that F0 =

∫Φ(· | λ)Pλ,0(dλ) and F1 =

∫Φ(· | λ)Pλ,1(dλ) has a positive

measure in X .

5. Q (Pλ) is continuous on Pλ in the weak topology.

We leave out lengthy discussion of these assumptions because most of these assumptions are madefor similar reasons as those in Assumption 1. More discussion is in the beginning of Section 6. We alsorequire that the choice of grids on Λ satisfies the following properties.

Condition 2. 1. Let ΛR =(λ1, . . . , λR

)become dense in Λ as R→∞.

2. Pλ,R ⊆ Pλ,R+1 ⊆ Pλ for all R ≥ 1.

3. R(N)→∞ as N →∞ and it satisfies R(N) logR(N)N → 0 as N →∞.

Theorem 3. Suppose Assumption 3 and Condition 2 hold. Then, dLP

(Pλ,N , Pλ,0

)p→ 0 and dLP

(FN , F0

)p→

0.

A brief proof is in an appendix. However, most of the steps mirror the proofs of Theorems 1 and2, and so the appendix omits the unchanged steps for conciseness.

Next we account for the simulation error in the basis functions from approximating the integralwith respect to Φ(β | λr). Denote the resulting distribution estimator with simulated basis functionsby FN,S (β) ≡

∑R(N)r=1 θrSΦ(β | λr), where

θS = arg minθ∈∆R(N)

1

N

∑N

i=1

∑J

j=1

yi,j − R(N)∑r=1

θr1

S

S∑s=1

g(xi, βr,s)

2

. (12)

Note that all simulation draws are independent and that different draws are used for each r. Weobtain the mixing ratios θS as in (12) using the simulated basis functions and our distribution functionestimator is still FN,S (β), which belongs to FMR . Note that we only use simulation to approximateΦ(β | λr) and to obtain θS in (12). Our distribution estimator is still FN,S (β), not FN,S (β) ≡∑R(N)

r=1 θrS1S

∑Ss=1 1 [βr,s ≤ β], because the CDF Φ(β | λr) is known to researchers. In the normal

distribution case, specialized software calculates the normal CDF Φ(β | λr) with much less error thantypical simulation approaches. We show our estimator is consistent when the number of simulationdraws tends to infinity.

Theorem 4. Suppose Assumption 3 and Condition 2 hold. Let S →∞. Then, dLP

(FN,S , F0

)p→ 0.

The proof is in the appendix.

17

6 Discussion of Examples

We return to the examples we introduced in Section 2. We discuss the two key conditions for each modelg (x, β): Assumption 1.4, identification of F (β), and Assumption 1.5, continuity of the populationobjective function under the Lévy-Prokhorov metric. Throughout this section, we assume Assumptions1.1–1.3 hold. Note that Matzkin (2007) is an excellent survey of older results on the identification ofmodels with heterogeneity.

For the mixture of continuous densities, the identification of the mixture distribution, Assumption3.4, will typically occur if the underlying distribution of heterogeneous parameters is identified, As-sumption 1.4, and the basis functions are chosen appropriately. We do not need to explicitly verifyAssumption 3.4 once Assumption 1.4 is verified. We also do not explicitly verify Assumption 3.5,continuity of the population objective function in the continuous mixtures cases. An important pointis that even when g(x, β) is nonsmooth as in Examples 2 and 3, g(x, λ) =

∫g(x, β)dΦ(β | λ) can

still be continuous and differentiable in λ because it is smoothed by integration with respect to thedistribution Φ(β | λ). Therefore, Assumption 3.5 can be satisfied by Remark 2.

Example. 1 (logit) The identification of F (β) in the random coefficients logit model is the main con-tent of Fox, Kim, Ryan, and Bajari (2012, Theorem 15).4 Assumption 14 in Fox et al states that “Thesupport of x, X contains x = 0, but not necessarily an open set surrounding it. Further, the supportcontains a nonempty open set of points (open in Rdim(xj)) of the form

(x′2, . . . , x

′j−1, x

′j , x′j+1, . . . , x

′J

)=(

0′, . . . , 0′, x′j , 0′, . . . , 0′

).”5 Fox et al also require the support of X to be a product space, which rules

out including polynomial terms in an element of xj or including interactions of two elements of xj .Given this assumption, Assumption 1.4 holds. Assumption 1.5 holds by Remark 2 in the current paper.

Example. 2 (binary choice) Ichimura and Thompson (1998, Theorem 1) establish the identificationof F (β) under the conditions that i) the coefficient on one of the the non-intercept explanatory variablesin x is known to either always be positive or either always be negative (the sign can be identified), ii)there is some normalization such as the coefficient known to be positive or negative is always either+1 or −1 (more generally the random coefficients lie on an unknown hemisphere), iii) there are largeand product supports on each of the explanatory variables other than the intercept. This rules outpolynomial terms and interactions. If we impose the scale normalization βk∗ = ±1, Assumption 1.4holds if we add large and product support conditions on each explanatory variable in x. An advantageof our estimator is the ease of imposing sign restrictions if necessary. Because the researcher picksBR(N) =

(β1, . . . , βR(N)

), the researcher can choose the grid so that the first element of each vector

βr is always positive, for example. Note that binary choice is a special case of multinomial choice, sothe non-nested identification conditions in example 3, below, can replace these used here. Next, notethat the continuity condition (Assumption 1.5) holds by Remark 3 when the support B is a finite set.We have not shown that continuity holds under only the identification assumptions of Ichimura andThompson (1998, Theorem 1), although we know of no counterexamples.

4Theorem 15 of Fox et al also allows homogeneous, product-specific intercepts.5Fox et al discusses what x = 0 means when the means of product characteristics can be shifted.

18

Example. 3 (multinomial choice without logit errors) Fox and Gandhi (2015, Theorem 2) studythe identification of the multinomial choice model without logit errors. Our linear specification of theutility function for each choice is a special case of what they allow. Fox and Gandhi require a choice-j-specific explanatory variable with large and product support. On other hand, Fox and Gandhi allowpolynomial terms and interactions for x’s other than the choice-j-specific large support explanatoryvariables, unlike examples 1 and 2. They do not require large support for the x’s that are not thelarge support, choice-specific explanatory variables. The most important additional assumption foridentification is that Fox and Gandhi require that F (β) takes on at most a finite number T of supportpoints, although the number T and support point identities β1, . . . , βT are learned in identification.The number T in the true F0 is not related in any way to the finite-sample R (N) used for estimationin this paper. So Assumption 1.4 holds under this restriction on F . Next, the continuity Assumption1.5 holds also by Remark 3 when the support B is a finite set.

7 Estimation Error Bounds

We next derive convergence rates for the approximation of the underlying distribution of heterogeneousparameters F (β) for a subset of the models allowed in the consistency theorems in Sections 3 and5.2. These results highlight, among other issues, the curse of dimensionality in the dimension of theheterogeneous parameters. Larger sample sizes are needed if the dimension K of the heterogeneousparameters is larger. We present results for the least squares criterion and do not include corollariesfor the likelihood criterion.

7.1 Discrete Approximation Estimator

Our results apply to a narrower set of true models gj(x, β) than the earlier consistency theorems. Inparticular, we require that gj(x, β) be smooth, so we allow our Example 1, the logit case, but not theother examples, where gj(x, β) involves an indicator function. Nevertheless, the random coefficientslogit is a leading model used in empirical work and the results exploiting the smoothness of gj(x, β)

are illustrative of issues, such as the curse of dimensionality, that apply also to nonsmooth choices forgj(x, β).

Our results relate to the literature on sieve estimations (e.g., Newey (1997) and Chen (2007)).Of course, the major difference is our choice of discrete basis functions, motivated by our estimator’scomputational advantages and our desire to easily constrain our estimate to be a valid CDF. We cannotdirectly apply previous results because of our different sieve space. Our proof technique to derive ourerror bounds uses results on quadrature, as we will explain.

We restrict the true distribution F0 to lie in a class of distributions that have smooth densities on acompact space of heterogeneous parameters. We define the parameter space for the choice probabilitiesP (x, F ) as the collection of those generated by such smooth densities:

H =

{P (x, F ) | max

0≤s≤ssupβ∈B|Dsf(β)| ≤ C,B =

∏K

k=1

[βk, βk

], f = dF ∈ Cs [B]

}, (13)

19

where Ds = ∂s

∂βα11 ...∂β

αKK

, s = α1 + · · · + αK with D0f = f , giving the collection of all derivatives of

order s. Also, Cs[B] is a space of s-times continuously differentiable density functions defined on B.Therefore we assume any element of the class of density functions that generates H is defined on aCartesian product B, is uniformly bounded by C <∞, is s-times continuously differentiable, and hasall own and partial derivatives uniformly bounded by C. The definition of H depends on C and s; thedegree of smoothness s will show up in our convergence rate results.

Let the space of approximating functions be

HR(N) ={P (x, F ) | F ∈ FR(N)

}. (14)

We require that the grid points accumulate in FR(N) such that HR ⊆ HR+1 ⊆ . . .. Our approach usesresults from quadrature to pick approximation choices θr such that we can approximate the choiceprobability P (x, F0) arbitrarily well using approximating functions in HR(N) as R (N)→∞.

In this section and in the corresponding proofs, we let ‖v‖E =√v′v, ‖h‖2L2,N

= 1N

∑Ni=1 ||h(xi)||2E ,

‖h‖2L2=∫||h||2Ed$ (the norm in L2), and ‖h‖2∞ = supx∈X ||h(x)||2E for any function h : X → R,

where $ denotes a probability measure on X . We introduce the linear probability model error ei,j ,as in yi,j = Pj(xi, F ) + ei,j and E [ei,j | X1, . . . , XN ] = 0. In addition to restricting the class of truedistributions, we make the following additional assumptions.

Assumption 4. (i)(ei = (ei,1, . . . , ei,J)′

)Ni=1

are independently distributed; (ii) E [ei | X1, . . . , XN ] =

0; (iii) (Xi)Ni=1 are i.i.d. with a density function bounded above; (iv) g(x, β) is s-times continuously

differentiable w.r.t. β and its (all own and partial) derivatives are uniformly bounded ; (v) the sievespace defined in ( 14) satisfies HR ⊆ HR+1 ⊂ . . . .

Assumptions 4 (i)-(iii) are about the structure of the data and in particular they allow for het-eroskedasticity for the linear probability error, which is necessary for linear probability models. Aspreviewed earlier, Assumption 4 (iv) assumes that the true model g(x, β) is differentiable. We useAssumption 4 (iv) to approximate P (x, F ) using a sieve method for F combined with a quadraturemethod for the choice of weights θr. Assumption 4 (iv) is satisfied by Example 1. Assumption 4 (v)was mentioned previously.

Any asymptotic error bound consists of two terms: the order of bias and the variance. Whileobtaining the variance term is rather standard in the sieve estimation literature, deriving the biasterm depends on the specific choice of basis function (e.g., power series or splines in the previousliterature). Because our choice of basis functions is new in the sieve literature, we first state the orderof bias, meaning the approximation error rate of our sieve approximation to arbitrary conditional choiceprobabilities P (x, F ) in H. Keep in mind this bias result is primarily about the flexibility of a class ofapproximations and has less to do with using a finite sample of data.

Lemma 1. Suppose P (x, F ) ∈ H and suppose Assumptions 4 (iv) and (v) hold. Then there exist F ∗ ∈FR(N) such that for all x ∈ X , ‖P (x, F ∗)− P (x, F )‖2E = O

(R−2s/K

). Further suppose Assumptions

4 (i)-(iii) hold. Then, ‖P (xi, F∗)− P (xi, F )‖2L2,N

= OP(R−2s/K

).

To prove this result, we combine a quadrature approximation with a power series approximationto approximate P (x, F ) =

∫g (x, β) f(β)dβ. We first approximate g (x, β) f(β) using a tensor product

20

power series in β. Then we approximate the integral of the tensor products approximation with respectto β using quadrature. The complete proof is in Appendix C.3.

Lemma 1 is the key ingredient that allows us to use machinery from the sieve literature for our esti-mator. Let C be a (generic) positive constant. Define ΨR ≡

(E[∑

j gj (Xi, βr) gj

(Xi, β

r′)])

1≤r,r′≤R(a R×R matrix) and its smallest eigenvalue as ξmin(R). Then we obtain the following estimation errorbounds.

Theorem 5. Suppose Assumptions 1.1, 1.3, and 1.4 and Condition 1 hold. Suppose Assumption 4holds. Suppose further that R(N)2 logR(N)

N → 0 and ξmin(R) > 0 for all finite R. Then if P (x, F0) ∈ Hand C is a particular constant,∥∥∥P (xi, FN)− P (xi, F0)

∥∥∥2

L2,N

≤ C ·max

{R(N) logR(N)

ξmin(R(N)) ·N,R(N)−2s/K

}w.p.a.1.

Theorem 5 establishes a bound on the distance between the true conditional choice probability andthe approximated conditional choice probability. It shows how the finite-sample estimator is able tofit data generated by a nonparametric choice of a heterogeneity distribution. Roughly speaking, in theestimation error bound the first term in the max operator corresponds to an asymptotic variance andthe second term corresponds to an asymptotic bias (Chen 2007). Fixing N , a larger R(N) reduces thebias but increases the variance and an optimal choice of R(N) is obtained by balancing the bias andvariance. Obtaining the variance term is standard in the sieve estimation literature and obtaining thebias term requires approximation results that depend on the type of sieves (in the previous literature,e.g., power series or splines). In the proof, we obtain this bias term using Lemma 1.

There are several lessons from this error bound. The approximation error rate in the bias termshows that we have faster convergence with a smoother density function s and slower convergence witha higher dimensional β, K. These results are intuitive. In many other settings, smoother distributionsare easier to approximate and higher dimensional distributions are harder to approximate. Practically,one might want to use caution when estimating an unrestricted joint distribution of β when K is large.

Note that the condition ξmin(R) > 0 for all finite R in Theorem 5 means the regressors gj (Xi, βr)

are not linearly dependent across different βr in least squares estimation. Theorem 5 allows the caselimR→∞ ξmin(R) = 0 but this term drops out in the convergence rate if limR→∞ ξmin(R) > 0. The caselimR→∞ ξmin(R) = 0 means that ΨR, which is the probability limit (as N → ∞ with fixed R) of thesample matrix

(1N

∑Ni=1

∑Jj=1 gj(xi, β

r)gj(xi, βr′))

1≤r,r′≤R, tends to be singular as R grows to infinity.

In this case, as often proposed in the literature (e.g. Hastie, Tibshirani, and Friedman 2009), we canpotentially reduce the variances of resulting estimators using a subset selection method or a shrinkagemethod, such as LASSO and ridge/Tikhonov regression. On the other hand, these approaches cancause additional bias in finite samples.

The previous theorem was for choice probabilities, but we are also interested in the distributionfunction itself. Using the result in Theorem 5, we can approximate the distribution of heterogeneousparameters with the following convergence rate.

Theorem 6. Suppose the assumptions and conditions in Theorem 5 hold. Then for a particular

21

constant C, w.p.a.1

∣∣∣FN (β)− F0 (β)∣∣∣ ≤ C√ R(N)

ξmin(R(N))max

{√R(N) logR(N)

ξmin(R(N)) ·N,R(N)−s/K

}a.e. β ∈ B.

Not surprisingly, the bias term in this bound suggests that the estimator of the distribution functionsuffers from a curse of dimensionality in the number of heterogeneous parameters K and has a fasterrate of convergence if the true generating process is smoother, as indexed by s.

7.2 Continuous Distribution Function with Compact Support

Here we consider the asymptotics for the smooth density estimator proposed in Section 5.2. In this ap-proach, we approximate the vector of choice probabilities P (xi) = (P1 (xi) , . . . , PJ (xi)) using smoothbasis functions as

P (xi) ≈R∑r=1

θr∫g(xi, β)dΦ(β | λr),

where Φ(β | λr) denotes the rth basis distribution function. We focus on simulation as the numericalintegration technique. Therefore,

P (xi) ≈R∑r=1

θr∫g(xi, β)dΦ(β | λr) ≈

R∑r=1

θr

(1

S

S∑s=1

g(xi, βr,s)

), (15)

where the (βr,s)Ss=1 are drawn from Φ(β | λr). We then can rewrite (15) as

P (xi) ≈R∑r=1

θr

(1

S

S∑s=1

g(xi, βr,s)

)=

R∑r=1

S∑s=1

θr

Sg(xi, β

r,s) =R×S∑r=1

θrg(xi, βr)

for some β r and θr. Therefore, we can interpret (15) as the discrete approximation of P (xi) with R ·Sgrid points and R · (S − 1) restrictions such that the first group of S coefficients on the first group ofS regressors are identical to θ1

S , the second group of S coefficients on the second group of S regressorsare identical to θ2

S , and so on.Therefore, the smooth mixture model can be nested in the discrete approximation case if the support

of β is bounded. To see this note that we have FN,S (β) ≡∑R(N)

r=1 θrSΦ(β | λr) and the simulation-approximated distribution estimator by FN,S (β) ≡

∑R(N)r=1 θrS

1S

∑Ss=1 1 [βr,s ≤ β] where θS solves (12).

Then we can obtain∣∣∣FN,S (β)− F0(β)∣∣∣ ≤ ∣∣∣FN,S (β)− FN,S(β)

∣∣∣+∣∣∣FN,S (β)− F0(β)

∣∣∣≤

R(N)∑r=1

θrS

∣∣∣∣∣ 1SS∑s=1

1 [βr,s ≤ β]− Φ(β | λr)

∣∣∣∣∣+∣∣∣FN,S (β)− F0(β)

∣∣∣ p→ 0 (16)

as N →∞ and S →∞. The first term in (16) goes to zero because for any given r the empirical distri-bution obtained from the simulation draws converges to the CDF of the draws (by the Glivenko–Cantelli

22

theorem) and the second term in (16) goes to zero because FN,S (β) can be seen exactly as our dis-crete approximation estimator. We can also bound the convergence rate of the first term in (16) byR(N)/

√S. Note that the term

∣∣∣ 1√S

∑Ss=1 {1 [βr,s ≤ β]− Φ(β | λr)}

∣∣∣ = OP∗(1) for each given λr by acentral limit theorem because βr,s are iid draws from Φ(β|λr) and E∗[1 [βr,s ≤ β]] = Φ(β|λr) for eachr, where E∗[·] denotes expectation and P∗ denotes probability measure with respect to the simulationdraw. It follows that

∑R(N)r=1 θrS

1√S

∣∣∣ 1√S

∑Ss=1 {1 [βr,s ≤ β]− Φ(β | λr)}

∣∣∣ ≤ C ·R(N)/√S w.p.a.1, which

is the bound for the first term in (16). As discussed above, Theorem 6 gives the convergence rate ofthe second term in (16), again because FN,S (β) can be seen exactly as our discrete approximationestimator.

Of course, the approach of nesting the mixture of continuous densities into the discrete approx-imation does not handle the full set of models allowed by the consistency theorem, Theorem 3. Inparticular, mixtures of normals are not allowed. Future work could explore more general results forthe rate of convergence of the estimator using a mixture of continuous densities.

8 Dynamic Programming Monte Carlo

In our Monte Carlo study, we apply our fixed grid estimators to a capital replacement model in thespirit of Rust (1987). A dynamic program must be solved for each grid point or simulation draw, soworking with a dynamic program showcases our method’s advantage of requiring computation of choiceprobabilities only before optimization commences. We vary the number of grid points, the sample sizesand the true distributions of the random coefficients. We compare the least squares and likelihoodcriteria on a fixed grid (fixed grid estimators) with an alternative likelihood criterion on a flexible grid(flexible grid estimator). For the fixed grid, we optimize over only weights. For the flexible grid, weoptimize over both the grid points and the weights, using a much smaller number of weights because ofthe computational challenges of both optimization and the need to solve the dynamic program manytimes. The two likelihood estimators are maximized using the EM algorithm. Our results suggestthat the two fixed grid estimators are much faster than the flexible grid estimator at the cost of somestatistical accuracy.

8.1 Heterogeneous Parameter Capital Replacement Model

We consider a capital replacement problem. A firm i uses one machine to produce output every periodt. Typically, the efficiency of the machine declines with time and the firm eventually finds it profitableto replace the machine. Each period, the firm can choose to do nothing or to replace the existingmachine with one of J − 1 different types of machines. Replacing the existing machine resets firm i’smachine’s age ai,t to 0. Denote the replacement decision as j ∈ {1, 2, . . . , J}, where j = 1 means noreplacement and j ≥ 2 means replacing the current machine with the new machine j and resetting theage ai,t to 0.

The observables (in addition to actions) to the researcher are the age ai,t, a scalar shifter of the flowprofits from running a machine di,t, the scalar current machine’s loss rate from age wi,t in the flow profitfrom running a machine, the vector of all J − 1 machines’ scalar loss rates from age vi = (v2,i, . . . , vJ,i)

and the matrix of shifters of the replacement costs of the J−1 machines ci = (c2,i, . . . , cJ,i), where each

23

cj,i is itself a vector of length Kc. The vector of efficiency loss rates vi and the matrix of replacementcost shifters ci are specific to each firm i and are not time-varying. The value of the scalar wi,t is oneof the elements of the vector vi, corresponding to the loss rate for the firm’s current machine. Collectthe observables (other than actions) as xi,t = (ai,t, di,t, wi,t, vi, ci). The time-varying, observable statevariables are si,t = (ai,t, di,t, wi,t).

The flow profits are also affected by a vector of firm-specific and time-invariant heterogeneousparameter βi and a vector of firm- and time-specific unobservables εi,t = (ε1,i,t, . . . , εJ,i,t). We discussthe components of the vector βi below. We seek to estimate the distribution G (β) of βi, or how theheterogeneous parameters vary across firms. Following the setup without time invariant heterogeneousparameters in Rust (1987), the time varying unobservables in εi,t are independent across firms, machinesand time and identically distributed according to the extreme value type I distribution, which giveslogit probabilities for the replacement decisions. As in Rust, the so-called integrated value functiondoes not need to include εi,t as a state variable because εi,t is independent over time.

The flow profit for action j is ui (j, si,t) + εj,i,t, where

ui (j, si,t) =

{βi,1 + βi,2w

−ai,ti,t + βi,3di,t, j = 1

−β′i,4cj,i j ≥ 2.

Here the heterogeneous parameters βi,1, βi,2 and βi,3 are scalars, the heterogeneous parameter βi,4 isa vector, and βi = (βi,1, βi,2, βi,3, βi,4) is a vector of length K = 3 + Kc. In addition to incorporatingshifters, the main content of the flow profit equation is that profit declines in machine age ai,t.

Recall that the time-varying state variables for firm i are si,t = (ai,t, di,t, wi,t). The scalar age ai,tof the machine evolves as

ai,t+1 (j, ai,t) =

{min {ai,t + 1, 5} , j = 1

0, j ≥ 2,

which means that the age resets to 0 after replacement and also stops increasing after an age of 5.The current efficiency loss rate wi,t corresponds to the efficiency loss of firm i’s current machine andso transitions as

wi,t =

{wi,t, j = 1

vj,i, j ≥ 2.

The shifter di,t has a finite support and evolves according to the Markov process Pr (di,t+1 | di,t), whichis independent of the firm’s replacement decision and other variables.

For this model, the time invariant terms that distinguish firm i from other firms are the vector ofheterogeneous parameters βi, the matrix ci, and the vector vi. Therefore, the dynamic program needsto be solved for each firm i. Following Rust (1987), the integrated value function and the conditional

24

choice probabilities, which both integrate out εi,t, are

Vi (si,t) = 0.577 . . .+ ln

J∑j=1

exp (ui (j, si,t) + δEVi (si,t+1 | si,t, j))

and

gj,i (si,t) =exp (ui (j, si,t) + δEVi (si,t+1 | si,t, j))∑J

j′=1 exp (ui (j′, si,t) + δEVi (si,t+1 | si,t, j′)),

where δ is the discount factor and EVi (si,t+1 | si,t, j) is the choice specific continuation payoffs function.The expectations are non-trivially taken with respect to only di,t because the transitions of (ai,t, wi,t)

are deterministic given actions. Each value function Vi is stored on the computer as a vector of lengthequal to the number of distinct values of si,t.

To write the dynamic programming choice probabilities in the notation of the general mixturemodel (1), let gj (xi,t, βi) = gj,i (si,t) be the probability of replacement action j given the observablesxi,t and the time invariant, heterogeneous parameters βi. We solve the value function, a system ofnonlinear equations, using value function iteration. Using the fixed grid estimation approach, there isone value function to solve for each combination of a grid value for βi and a value for the observable,time invariant firm characteristics (ci, vi).

8.2 Estimators

Our consistency and rate results in this paper are for cross-sectional data, not panel data. Therefore,we generate cross sectional data to test the estimators. Denote the replacement decision of firm i

as yi ∈ {1, . . . , J}. The data consists of (yi, xi) for i = 1, . . . N . Each firm is observed only onceand the states and actions are statistically independent across firms, conditional on observables. Forthe dimension of the vector of heterogeneous parameters on the cost of replacement shifters, β4, weconsider both Kc = 1 and 5. Given that K = 3 +Kc, we estimate models with either K = 4 or K = 8

total heterogeneous parameters. In other words, the unknown distribution F (β) is either a four- oreight-dimensional function.

We estimate using the same simulated data set using the least squares and likelihood fixed gridestimators and the likelihood flexible grid estimator. We use identical fixed grids for the least squaresand likelihood criteria. The flexible grid likelihood method estimates both the vector of grid pointsBR and the vector of weights

(θ1, . . . , θR

)on each grid point. We use the MATLAB constrained linear

least squares routine, lsqlin, for the least squares criterion. We follow the implementations of theEM algorithm for static logit models in Train (2008, Sections 4 and 6) for the fixed and flexible gridlikelihood estimators. The details are in Appendix D. Our convergence rule for the fixed grid EMalgorithm is that either the maximum absolute change in the weight is less than 10−9 or the numberof iterations is greater than 2000. In our Monte Carlo specifications, convergence is usually achievedwithin 100 iterations.

The fixed grid estimators are guaranteed to converge to the global optimum of the objective functionfrom any starting value, as our optimization problems are convex. The flexible grid estimator isguaranteed to converge to only a local maximum of the likelihood function.

We apply the fixed grid estimators to sample sizes of N = 100, 500, and 1000 observations on

25

independent firms. For each Monte Carlo choice of K, N , R and the number of mixture componentsM in the true data generating process (described below), we perform L = 100 replications of estimationusing different fake data sets. For each replication, we use the same dataset for all three estimators. Dueto the extremely long computational time of the flexible grid estimator for our dynamic programmingmodel, we only run the flexible grid estimator for the sample size N = 50 and the number of grid pointsR = 10. For similar computational reasons, the flexible grid estimator runs for only 10 iterations ofthe EM algorithm from one starting value, so its measured statistical error is an upper bound to theperformance of the flexible grid estimator run to convergence from many starting values.

Our dynamic programming model highlights the computational savings of fixed grid estimators. Fora grid vector βr, the value function also depends on the firm specific, time-persistent observables (ci, vi).We solve in total N ·R different dynamic programming problems when using a fixed grid estimator withR grid points. For each Monte Carlo replication, the dynamic programming computation is done onceand applied to both the least squares and likelihood criteria. In contrast, the flexible grid estimator ismuch more computationally intensive because it needs to solve many dynamic programming problemsevery time the optimizer calls choice probabilities. Choice probabilities are usually called thousands oftimes during optimization in a single EM iteration, resulting in far more dynamic programs than theN ·R from the fixed grid approaches.

8.3 Monte Carlo Data Generating Process

We now specify the data generating process for the Monte Carlo study. We set J = 6 so that thereare five machines. The discount factor δ is 0.95. Given the focus on testing the estimators on crosssectional data, we draw the heterogeneous parameters βi independently of xi. Therefore, we do notexplore statistical endogeneity, the so-called initial conditions problem in panel data.

The elements of xi have finite, continuous uniform or normal distributions.6 The elements of(ci, vi) have independent standard uniform distributions across machines. The profit shifter di,t has 50possible discrete values, drawn from a normal distribution and fixed across Monte Carlo specificationsand replications. We then generate the transition matrix with elements Pr (di,t+1 | di,t) by drawingthe elements from independent uniform distribution and normalizing

∑supp(di,t+1) Pr (di,t+1 | di,t) to

be 1. The terms Pr (di,t+1 | di,t) are also fixed across Monte Carlo specifications and replications. Thecurrent di,t for firm i is uniformly drawn from its support. The current age ai,t is uniformly drawnfrom the integers 0 to 5. Firm i’s wi,t is sampled uniformly from firm i’s vi.

The true distributions F0 (β) are mixtures of multivariate normals, each with nonzero correlations.We vary the number M of mixture components in the true distribution. There are either M = 1, 2 or5 mixture components. Each of the normal mixture components has an equal weight of 1/M .

6An identification result for the distribution of heterogeneous parameters in the random coefficients logit relies onexplanatory variables being continuous, as we discuss for Example 1 in Section 6. Some of the elements of xi havecontinuous distributions and others have discrete distributions, to simplify dynamic programming.

26

Define five sets of means for the maximum K of eight heterogeneous coefficients

µ1 = [0.375,−2, 2, 2, 0.875, 0.75, 1.25, 1.875]

µ2 = [0.25, 1,−1, 1.625, 2, 0.125, 1.25, 2]

µ3 = [0.375, 2,−2, 0.375, 1.75, 0.625, 0.25, 0.125]

µ4 = [0.5,−2,−2, 1.75, 0.75, 1.625, 0.875, 1.875]

µ5 = [0.25, 0,−1, 0.625, 0.375, 0.125, 0.125, 1.25] .

We use these terms for the data generating process for all specifications. Denote the first K elementsof the vector µm as µm (K). Say a particular Monte Carlo specification uses K = 4 and M = 2. Thenthe true distribution of β is a mixture of two normals, each with weight 1/2, and the mth componenthas mean µm (4).

Defining the covariance matrix of each mixture component is more involved. A base matrix is asymmetric matrix that is 4.3562 on the diagonal and 0.5252 elsewhere:

Σ? =

4.3562 0.5252 . . .

0.5252 4.3562 . . ....

.... . .︸︷︷︸

8 elements

.

Also denote the Kth leading principal submatrix of Σ? as Σ (K). In the data generating process, a

mixture of 1 ≤M ≤ 5 normals of dimension K is defined as F (β) =∑M

m=1

1

MN (µM (K) ,ΣK), where

the variance matrix of each component is

ΣK = Σ (K) +1

5

5∑m=1

(µm (K)− µ5 (K))′ (µm (K)− µ5 (K))

− 1

M

M∑m=1

(µm (K)− µM (K))′ (µm (K)− µM (K)) ,

where µM (K) = 1M

∑µm (K). In our specifications, this somewhat recursive choice of ΣK makes the

true distributions F (β) have the same variance matrix across choices of M , so there is not a tight linkbetween the number of modesM and the baseline grid’s coverage of the true distribution across MonteCarlo specifications.7

We use a quasi-random number sequence to generate a baseline grid for the two fixed grid estimators.We draw R = 100 or 200 values from a Halton sequence. The draws are low-discrepancy points inthe box [0, 1]K , and we map them into a larger region by multiplying each element by 15. We shiftthe coefficient βi,1 on the loss rate due to age by -1.5, the intercept βi,2 by -7.5, the coefficient βi,3 by

7Note that the variance matrix of∑Mm=1

1

MN (µM (K) ,ΣK) is ΣK+

1

M

∑Mm=1 (µm (K)− µM (K))′ (µm (K)− µM (K)).

With our choice of ΣK , the variance matrix becomes Σ (K) +1

5

∑5m=1 (µm (K)− µ5 (K))′ (µm (K)− µ5 (K)) for any

M .

27

-7.5, and the elements of the vector βi,4 by 1.5.8 We generate the grid for evaluation of measures likeRMISE, defined below, in the same way, except we draw 5000 points.

8.4 Results on Speed

The results of various Monte Carlo specifications are in Table 1 for the fixed grid least squares esti-mator, Table 2 for the fixed grid likelihood estimator, and Table 3 for a comparison between all threeestimators, including the flexible grid likelihood estimator. All times are for code parallelized over 12cores.

Focus first on the speed of the fixed grid estimators in Table 1 and 2. The tables break out the timespent on computing dynamic programs before optimization and the time spent on optimization / theEM algorithm. Table 1 shows that the overwhelming majority of time for the least squares estimatoris spent on solving the N ·R dynamic programs before optimization begins. In our largest sample sizeN = 1000 and largest estimation grid R = 200, the dynamic programming takes on average (acrossreplications) 1.7 hours. The optimization command for fixed grid least squares, lsqlin in MATLAB,consumes an insignificant amount of time, under 7 seconds in all the specifications considered. Table 2,for the fixed grid likelihood criterion, does not report dynamic programming time because the dynamicprograms reported in Table 1 are reused. The EM algorithm is even faster than the MATLAB commandlsqlin; optimization takes under 3 seconds for all specifications in Table 2. The bottom line is thatdynamic programming is far more costly than optimizing the statistical objective functions.

Table 3 compares all three estimators, including the flexible grid estimator, on a problem withN = 50, R = 10 and K = 4. The two fixed grid estimators require solving N · R = 500 dynamicprograms before optimization begins. For only 10 EM algorithm iterations from a single startingvalue, the flexible grid estimator requires solving on average (across the 100 Monte Carlo replications)over 200,000 dynamic programs for all specifications, as dynamic programs are nested inside eachcall to calculate choice probabilities. The average time of the 10 EM algorithm iterations for theflexible grid estimator is around 10,000 seconds, or around 2.7 hours. This compares to 12 seconds toconvergence (not just 10 iterations) for the fixed grid estimators, including the dynamic programmingtime. Altogether, the comparison in Table 3 shows the tremendous speed benefits of fixed grid overflexible grid methods.

For the flexible grid estimator, Table 3 reports only the outcome of 10 iterations from a singlestarting value in order to reduce the run time of the Monte Carlo. In serious empirical work, morestarting values should be used and the EM algorithm should be run for a variable number of iterations,until some notion of convergence is achieved.

8.5 Results on Statistical Accuracy

We measure estimation accuracy using root mean integrated squared error (RMISE) and integratedabsolute error (IAE). Denote the estimated distribution of β from one Monte Carlo replication l as Fland the true distribution as F0. The estimated CDF is evaluated on a grid consisting of Q = 5,000

points, so that the true CDF values of these points are smoothly spread from 0 to 1. The RMISE for8Some βi,2 are negative, meaning that the machine’s output actually improves with age.

28

a Monte Carlo specification with L replications is√√√√ 1

L

L∑l=1

1

Q

Q∑q=1

(Fl (βq)− F0 (βq)

)2.

The IAE for a given replication l is defined as

1

Q

Q∑q=1

∣∣∣∣Fl (βq)− F0 (βq)

∣∣∣∣.We report the mean, minimum and maximum IAE across replications for each estimator and MonteCarlo specification.

Table 1 and Table 2 report the RMISE and IAE of the least squares and likelihood fixed gridestimators, respectively. The two estimators take as input the same grid and corresponding choiceprobabilities for each grid point and firm; they differ only in the statistical criterion. The tables showthat both the RMISE and IAE of the least squares estimator tend to be lower than the RMISE andIAE of the fixed grid likelihood estimator. At least in our setup, the least squares criterion seemspreferable because of its better statistical performance and the fact that its slower optimization speedis still trivial compared to the time spent on dynamic programming.

Tables 1 and 2 compare true data generating processes with M = 1, 2 and 5 multivariate normalcomponents. Both RMISE and IAE increase with the number of modes in the true distributiongenerating the data, although the amounts of the increase in RMISE and IAE are not large. Itappears to be harder to nonparametrically estimate distributions with more modes.

Tables 1 and 2 also vary K, the dimension of the heterogeneous parameters β. K is either 4 or8. Not surprisingly, RMISE and mean IAE tend to increase with K. The increase is largest for thecase of M = 1, where the true distribution is a single component multivariate normal. For higher M ,changing K from 4 to 8 increases in RMISE and IAE by only a small amount. Compared to M = 1,the difference is poorer performance with K = 4, as the statistical performance of the K = 8 case isless sensitive to M . While there is some evidence of estimation error substantially increasing with Kfor M = 1, for other M the Monte Carlo study does not demonstrate the curse of dimensionality in Kfound in the upper bound of the convergence rate of the CDF supremum estimation error in Theorem6.

Tables 1 and 2 also shed light on how grid size R affects statistical accuracy. WithK = 4, increasingthe grid size from 100 to 200 reduces RMISE and IAE. The reduction is less perceptible with higherdimensional heterogeneous parameters, K = 8.

Tables 1 and 2 also report the mean (across replications) number of the R grid points whose weightsθr are estimated to be nonzero (greater than 0.001). For both fixed grid estimators, it is common tohave around 10 nonzero weights, out of a grid of either R = 100 or R = 200 points. Both estimatorsfavor sparse representations of the true distribution without including any sort of LASSO-style penalty.The number of nonzero weights varies across replications but there is no strong trend favoring morenonzero weights for either the least squares or fixed grid likelihood estimators. Both Tables 1 and 2compare specifications by varying K and M . Not surprisingly, higher M and K tend to lead to more

29

nonzero weights, but the effect is small relative to the grid sizes of R = 100 and R = 200.Tables 1 and 2 report how sample size N affects accuracy. Across sample sizes of N = 100, 500

and 1000, we see only tiny reductions in RMISE and mean IAE with N . This is consistent with anestimator that converges slowly, as Theorem 6 indicates.

Table 3 compares all three estimators on a small sample N = 50 with a small grid R = 10. Dueto the flexible estimator’s prohibitively long computational time, we iterate the EM algorithm for theflexible grid estimator ten times from one starting point. We use K = 4 and vary M to be 1, 2 or5, as in Tables 1 and 2. Although the flexible grid estimator has not converged and we only use onestarting value, its RMISE and IAE are substantially lower than the fixed grid results. Table 3 showsthat there is a large statistical advantage of being able to estimate a flexible grid instead of fixing a gridof heterogeneous parameters. Bringing in the run times in Table 3, there is a clear tradeoff betweenthe prohibitive speed of the flexible grid estimator and the poorer statistical performances of the fixedgrid estimators.

9 Conclusion

We analyze a nonparametric approach for estimating general mixtures models. We fix the grid ofvectors of heterogeneous parameters before estimation. Our approach allows the researcher to dropstandard parametric assumptions, such as independent normal heterogeneous parameters, that arecommonly used in applied work. We consider both a least squares (linear probability model) criterionand a likelihood criterion. Convergence of an optimization routine to the global optimum is guaranteedfor least squares and the likelihood is globally concave, something that cannot be said for statisticalobjective functions where the grid of vectors of heterogeneous parameters is to be optimized over.Our approach is useful for reducing the number of times complex structural models, such as dynamicprograms, need to be evaluated in estimation. A Monte Carlo study shows that our estimators are muchfaster than a likelihood estimator where one optimizes over the grid of heterogeneous parameters for adynamic programming example. However, the flexible grid estimator has better statistical accuracy.

We explore the asymptotic properties of the nonparametric distribution estimators. We showconsistency in the function space of all distributions under the weak topology by viewing our estimatorsas sieve estimators and verifying high-level conditions in Chen and Pouzo (2012). We also showconsistency for i) a model with homogenous parameters and ii) an estimator based on a mixture oversmooth basis distribution functions on possibly unbounded supports. We verify some of the conditionsfor consistency for three example discrete choice models, each of which is widely used in empiricalwork. We also derive the convergence rates of the least squares nonparametric estimator for a classof differentiable models, for which we can derive the approximation error rates of our nonparametricsieve space.

30

A Well-posedness

Chen (2007) and Carrasco, Florens and Renault (2007) distinguish between functional (here the distri-bution) optimization and identification problems that are well-posed and problems that are ill-posed.Using Chen’s definition, the optimization problem of maximizing the population criterion functionQ (F ) with respect to the distribution function F will be well-posed if dLP(Fn, F0) → 0 for all se-quences {Fn} in F such that Q(Fn) − Q(F0) → 0. The problem will be ill-posed if there exists asequence {Fn} in F such that Q(Fn) − Q(F0) → 0 but dLP(Fn, F0) 9 0.9 We now argue that ourproblem is well-posed.

Note that F is compact in the weak topology (Assumption 1.1). Also, Q (F ) is continuous on F byAssumption 1.5. It follows that with our choice of the criterion function and metric, our optimizationproblem is well posed in the sense of Chen (2007) because for every ε > 0 we have

infF∈FR(N):dLP(F,F0)≥ε

(Q(F )−Q(F0)) ≥ infF∈F :dLP(F,F0)≥ε

(Q(F )−Q(F0)) > 0, (17)

where the first inequality holds because FR(N) ⊂ F by construction and the second, strict inequalityholds as the minimum is attained by continuity and compactness and because the model is identified(Assumption 1.4), as we argue in the proof of Theorem 1. Therefore, our optimization problem satisfiesChen’s definition of well-posedness.

B Proofs of Consistency

B.1 Proofs of Theorems 1 and 2

We provide proof of the consistency theorem for models with homogenous parameters because theessentially same proof can be applied to the models without homogeneous parameters.

We verify the conditions of CP’s Lemma A.2 (also see Theorem 3.1 in Chen (2007)) in our con-sistency proof. To provide completeness, we first present our simplified version of CP’s Lemma A.2,which does not incorporate a penalty function. We let α = (γ, F ) ∈ A ≡ Γ× F , AR(N) ≡ Γ× FR(N),and with possible abuse of notation dLP(α1, α2) ≡ ‖γ1 − γ2‖E + dLP(F1, F2) for α1, α2 ∈ A. DefineQN (α) = 1

NJ

∑Ni=1

∥∥yi − ∫ g (xi, β, γ) dF∥∥2

Eand Q(α) ≡ E

[∥∥y − ∫ g (x, β, γ) dF∥∥2

E/J].

Lemma 2. Lemma A.2 of CP: Let αN = (γ, FN ) be such that QN (αN ) ≤ infα∈AR(N)QN (α) +

Op(νN ) with νN → 0. Suppose the following conditions (B.2.1)–(B.2.4) hold:

• (B.2.1) (i) Q(α0) < ∞; (ii) there is a positive function δ(N,R(N), ε) such that for each N ≥ 1,R ≥ 1, and ε > 0, infα∈AR(N):dLP (α,α0)≥εQ(α)−Q(α0) ≥ δ (N,R(N), ε) and lim infN→∞ δ (N,R(N), ε) ≥0 for all ε > 0.

• (B.2.2) (i) (A, dLP(·)) is a metric space; (ii) AR ⊆ AR+1 ⊆ A for all R ≥ 1, and there exists asequence ΠNα0 ∈ αR(N) such that dLP (ΠNα0, α0)→ 0.

9Whether the problem is well-posed or ill-posed also depends on the choice of the metric. For example, if one usesthe total variation distance metric instead of the the Lévy-Prokhorov metric, the problem will be ill-posed because thedistance between a continuous distribution and any discrete distribution will always be equal to one in the total variationmetric.

31

• (B.2.3) (i) QN (α) is a measurable function of the data ((yi, xi))Ni=1 for all α ∈ AR(N); (ii) αN is

well-defined and measurable with respect to the Borel σ-field generated by the topology inducedby the metric dLP(·, ·).

• (B.2.4) (i) Let cQ(R(N)) = supα∈AR(N)

∣∣∣QN (α)−Q(α)∣∣∣ p→ 0;

(ii) max{cQ(R(N)), νN , |Q (ΠNα0)−Q (α0)|

}/δ (N,R(N), ε)

p→ 0 for all ε > 0.

Then dLP(αN , α0)p→ 0.

Using Lemma 2 we now provide our consistency proof for the least squares estimator. Becauseour estimator is an extremum estimator, we can take νN to be arbitrarily small. We start with thecondition (B.2.1). The condition Q(α0) < ∞ holds because Q(α) ≤ 1 for all α ∈ A, because we havea linear probability model. Next we will verify the condition

infa∈AR(N):dLP(α,α0)≥ε

Q(α)−Q(α0) ≥ δ(N,R(N), ε) > 0 (18)

for each N ≥ 1, R(N) ≥ 1, ε > 0, and some positive function δ(N,R(N), ε) to be defined below. Wewill use our assumption of identification (Assumption 1.4). Let m (x, α) = P (x, α0)− P (x, α), whereP (x, α) =

∫gj (x, β, γ) dF (β). Note that we have

Q(α) = E[||y − P (x, α0) +m(x, α)||2E/J

]= E

[||y − P (x, α0)||2E/J

]+ E

[||m(x, α)||2E/J

](19)

because E[(y − P (x, α0))′m(x, α)] = 0 by the law of iterated expectation and E[y − P (x, α0) | x] = 0.Therefore, for each α ∈ A, we have

Q(α)−Q(α0) = E[||m(x, α)||2E/J

]− E

[||m(x, α0)||2E/J

]= E

[||m(x, α)||2E/J

](20)

because m(x, α0) = 0. The condition (18) now holds due to our assumption of identification, as thefollowing argument shows.

Consider E[||m (x, α) ||2E ], with m (x, α) defined above, as a map from A to R+ ∪ {0}. For anyα 6= α0, E[||m (x, α) ||2E ] takes on positive values for each α ∈ A, because the model is identified on aset X with positive probability. Then note that E[||m (x, α) ||2E ] is continuous in α and also note thatAR(N) is compact. Therefore E[||m (x, α) ||2E ] attains some strictly positive minimum on {α ∈ AR(N) :

dLP(α, α0) ≥ ε}. Then we can take δ(N,R(N), ε) = infα∈AR(N):dLP(α,α0)≥εE[||m(x, α)||2E/J

]> 0 for

all R(N) ≥ 1 with ε > 0.

We have shown that δ(N,R(N), ε) > 0 for all N and hence lim infN→∞ δ(N,R(N), ε) ≥ 0. This isenough for (B.2.1). However, under our assumptions indeed lim infN→∞ δ(N,R(N), ε) > 0 because

δ(N,R(N), ε) = infα∈AR(N):dLP(α,α0)≥ε

(Q(α)−Q(α0)) ≥ infα∈A:dLP(α,α0)≥ε

(Q(α)−Q(α0))

= infα∈A:dLP(α,α0)≥ε

E[||m (x, α) ||2E/J ] > 0,

where the first inequality holds because AR(N) ⊆ A by construction and the second, strict inequalityholds because the model is identified (Assumption 1.4). Here, lim infN→∞ δ(N,R(N), ε) > 0 holdsbecause the last term in the right-hand side of the above inequality does not depend on N and the

32

term is strictly positive.10 This result makes it convenient to verify (B.2.4)(ii) below.Next we consider (B.2.2). First note that (A, dLP) is a metric space and we have AR ⊆ AR+1 ⊆ A

for all R ≥ 1 by construction of our sieve space. Then we claim that there exists a sequence offunctions ΠNα0 ∈ AR(N) such that dLP(ΠNα0, α0)→ 0 as N →∞. First, BR(N) becomes dense in Bby assumption. Second, AR(N) becomes dense in A because the set of distributions FR(N) on a densesubset BR(N) ⊂ B is itself dense. To see this, remember that the class of all distributions with finitesupport is dense in the class of all distributions (Aliprantis and Border 2006, Theorem 15.10). Anydistribution with finite support can be approximated using a finite support in a dense subset BR(N)

(Huber 2004).Next we consider (B.2.3). As Theorem 1 is a special case of Theorem 2, we focus explicitly on

Theorem 2. Consider first Assumption 2.5.a. Remark A.1.(1) (a) of CP says that (B.2.3) holds if eachsieve space AR is compact and the finite-sample objective function is lower semicontinuous in (γ, F ).First note that FR is a compact subset of F for each R because BR is a compact subset of B andhence AR is also a compact subset of A.11 Second we show that for any data ((yi, xi))

Ni=1, QN (α)

is continuous on AR for each R ≥ 1. Since AR is compact, this continuity means our estimator iswell defined as the minimum in (6). Because checking the continuity in γ is trivial for given models(e.g. logit model), we focus on the continuity on FR. For any F1, F2 ∈ FR(N), applying the triangleinequality, we obtain∣∣∣QN (γ, F1)− QN (γ, F2)

∣∣∣ ≤ 2∑N

i=1

∑J

j=1yi,j

∣∣∣∣∫ gj(xi, β, γ)(dF1 − dF2)

∣∣∣∣ /NJ (21)

+∑N

i=1

∑J

j=1

{∫gj(xi, β, γ)(dF1 + dF2)

} ∣∣∣∣∫ gj(xi, β, γ)(dF1 − dF2)

∣∣∣∣ /NJ≤ 4

∑N

i=1

∑J

j=1

∣∣∣∣∫ gj(xi, β, γ)(dF1 − dF2)

∣∣∣∣ /NJ,where the second inequality holds because yi,j , gj(xi, β, γ), and

∫gj(xi, β, γ)dF (β) are uniformly

bounded by 1 for all j and xi. Then because gj(xi, β, γ) is uniformly bounded by 1 and F1 andF2 are discrete distributions with the finite support BR, in this case the weak convergence implies thatalmost surely QN (γ, F ) is continuous on FR, i.e. for any F1, F2 ∈ FR such that dLP (F1, F2) → 0, itfollows that |QN (γ, F1)− QN (γ, F2)| → 0 almost surely.12 Therefore (B.2.3) holds by Remark A.1.(1)(a) of CP.

10As we discussed in the main text the space F of distributions on B is compact in the weak topology because weassume B itself is compact.

11Alternatively we can also see that FR is compact because the simplex, ∆R(N), itself is compact as we argue below.For any given R and BR, consider two metric spaces, (FR, dLP) and (∆R, || · ||E). Then we can define a continuous mapψ : ∆R → FR because any element in ∆R determines an element in FR. The map is continuous in the sense that forany sequence θn → θ in ∆R we have ψ(θn)→ ψ(θ) in FR. Then it is a simple proof to show that if ∆R is compact, thenFR = {ψ(θ) : θ ∈ ∆R} is also compact.

Proof. Consider an arbitrary sequence {Fn}n∈N ⊆ FR. Since Fn ∈ {ψ(θ) : θ ∈ ∆R} for all n ∈ N, we know that thereexists θn ∈ ∆R with ψ(θn) = Fn for all n ∈ N. Then {θn}n∈N ⊆ ∆R. Next note that since ∆R is compact, there existssome subsequence {θln}n∈N with θln → θ ∈ ∆R. Since the map ψ is continuous, it follows that ψ(θln) → ψ(θ). Andbecause ψ(θln) = Fln , then Fln → ψ(θ) ∈ FR because θ ∈ ∆R. Therefore, we conclude FR is also compact when ∆R iscompact.

12Note that∣∣∫ gj(xi, β)(dF1 − dF2)

∣∣ ≤∑Rr=1 |θ

r1 − θr2| → 0 as dLP (F1, F2)→ 0 for any finite R.

33

For the case of Assumption 2.5.b, we directly assume (B.2.3) for the reasons stated in Remark 8.Next there are two conditions to verify in (B.2.4). We first focus on the uniform convergence of the

sample criterion function in (B.2.4). Here we need to verify the uniform convergence over AR(N) suchthat

sup(γ,F )∈Γ×FR(N)

∣∣∣QN (γ, F )−Q(γ, F )∣∣∣ p→ 0. (22)

For this purpose, it is convenient to view QN (γ, F ) and Q(γ, F ) as functions of γ and θ ∈ ∆R(N)

and write them as QN (γ, θ) and Q(γ, θ), respectively. Then define, for any R, the class of measurablefunctions

GR =

l(y, x, θ, γ) =

∥∥∥∥∥y −∑r

θrg(x, βr, γ)

∥∥∥∥∥2

E

/J : (γ, θ) ∈ Γ×∆R

and note that QN (γ, θ) = N−1

∑Ni=1 l(yi, xi, θ, γ). Then again by Pollard (1984, Theorem II.24), the

uniform convergence (22) holds if and only if the entropy satisfies logN(ε, GR, || · ||L1,N ) = op(N) forall ε > 0. Note that the entropy measure of GR is bounded by the sum of two entropies, one associatedwith FR(N) and the other one associated with Γ. Below we show that the former is op(N). We alsonote that the latter satisfies the entropy condition (and so is op(N)) under Assumption 2.5a-2.5b byTheorem 2.7.11 of van der Vaart and Wellner (1996) for the Lipschitz case and because the class ofindicator functions belongs to the Vapnik-Červonenkis class and has a uniformly bounded entropy(Theorem 2.6.7 of van der Vaart and Wellner 1996).

Now we verify the entropy condition associated with FR(N). Let ∆R(N) be the R (N) unit simplex.Using measures of complexity of spaces, let N(ε, T , || · ||) denote the covering number of the set T withballs of radius ε with an arbitrary norm || · || and let N[](ε, T , ‖·‖) denote the bracketing number ofthe set T with ε-brackets. For ease of notation, below we suppress γ (the result holds for any givenγ ∈ Γ) and define for any R, the class of measurable functions

GR =

{l(y, x, θ) = ||y −

∑r

θrg(x, βr)||2E/J : θ ∈ ∆R

}. (23)

Note (i) QN (θ) = N−1∑N

i=1 l(yi, xi, θ), (ii) {(yi, xi)}Ni=1 are i.i.d., and (iii) E[supθ∈∆R(N)

|l(y, x, θ)|] ≤1 <∞. Then by Pollard (1984, Theorem II.24) (also see Chen (2007, Section 3.1, page 5592) for relateddiscussion), the entropy condition to satisfy becomes logN(ε,GR, || · ||L1,N )/N

p→ 0 for all ε > 0, where|| · ||L1,N denotes the L1(PN )-norm and PN denotes the empirical measure of the data ((yi, xi))

Ni=1.

The term l(y, x, θ) is Lipschitz in θ, as

|l(y, x, θ1)− l(y, x, θ2)| ≤ 1

J

J∑j=1

(2yj∑r

gj(x, βr)|θr1 − θr2|+

∑r

gj(x, βr)(θr1 + θr2)

∑r

gj(x, βr)|θr1 − θr2|

)

≤ M(·)∑R

r=1|θr1 − θr2| ≤M(·)

√R||θ1 − θ2||E

with some function E[M(·)2] < ∞. The first inequality is obtained by the triangle inequality andthe third inequality holds due to the Cauchy-Schwarz inequality. We also know ∆R is a compactsubset of RR. Now take M(·) = 4, noting that yj , gj(·), and

∑Rr=1 gj(x, β

r)θr are uniformly bounded

34

by 1. Then from Theorem 2.7.11 of van der Vaart and Wellner (1996), we have N[] (2ε,GR, ‖·‖) ≤

N(

ε4√R,∆R, ‖·‖E

)=(

4√Rε

)Rfor any norm ‖·‖. Therefore as long as R(N) logR(N)/N → 0, the

entropy condition associated with FR(N) holds because N(ε,GR, ‖·‖L1,N

)≤ N[]

(2ε,GR, ‖·‖L1,N

)≤(

4√Rε

)R(van der Vaart and Wellner 1996, page 84).

To satisfy the second condition in (B.2.4), we need to bound all three terms in the max{·} func-tion. We have shown the uniform convergence of the sample criterion function (this also satisfies(B.2.4) (i)) and we can take νN to be small enough. We also have |Q(ΠNα0) − Q(α0)| → 0, whichis trivially satisfied by the continuity of Q(α) at α0 and dLP (ΠNα0, α0) → 0. Therefore becauselim infN→∞ δ (N,R(N), ε) > 0, the condition (B.2.4) (ii) is satisfied.

We have verified all the conditions in Lemma 2 (Lemma A.2 of CP) and this completes the consis-tency proof.

B.2 Proof of Corollary 1

Similarly to the baseline least squares estimator, using Lemma 2 we show the consistency of the MLestimator. For ease of notation, we consider the model with heterogeneous parameters only. Thefollowing proof is parallel to the least squares case.

We start with (B.2.1). First, the condition Q(F0) <∞ holds under the assumption that Pj(x, F0)

is bounded away from zero for all j ≤ J . Similarly to the least squares case, next we show that forε > 0

lim infF∈FR(N):dLP(F,F0)≥ε

Q(F )−Q(F0) > 0. (24)

Note that for any F 6= F0, we have

Q(F )−Q(F0) = −

E J∑j=1

yj logPj(x, F )

− E J∑j=1

yj logPj(x, F0)

(25)

= −E

log

J∏j=1

Pj(x, F )/Pj(x, F0)yj

> − log

E J∏j=1

(Pj(x, F )/Pj(x, F0)) yj

= − log

E J∑j=1

Pj(x, F )

= 0,

where the inequality holds by Jensen’s inequality. Here the strict inequality holds because P (x, F0) 6=P (x, F ) with positive probability for any F 6= F0 (Assumption 1.4). The third equality holds by the lawof iterated expectation and Pr[yj = 1|x] = Pj(x, F0). The last result holds because

∑Jj=1 Pj(x, F ) = 1

by construction. Therefore, (24) is satisfied because (25) holds for any F ∈ F such that dLP(F, F0) ≥ ε,by essentially the same argument for the least squares case.

Next, (B.2.2) holds by the same argument for the proof of the least squares estimator.Next, we show (B.2.3) holds. We use Remark A.1.(i)(a) of CP. First note that FR is a compact

subset of F for each R because BR is a compact subset of B. Second we need to show for any data

35

((yi, xi))Ni=1, QN (F ) is continuous on FR for each R ≥ 1. Using the inequality that for 0 < c ≤ a, b,

| log a− log b| ≤ |a− b|/c, we obtain for some constant C

∣∣∣QN (F1)− QN (F2)∣∣∣ ≤ ∑N

i=1

J∑j=1

yi,j |logPj(xi, F1)− logPj(xi, F2)| /N

≤ C∑N

i=1

J∑j=1

yi,j |Pj(xi, F1)− Pj(xi, F2)| /N

= C∑N

i=1

J∑j=1

yi,j

∣∣∣∣∫ gj(xi, β)(dF1 − dF2)

∣∣∣∣ /Nand therefore (B.2.3) holds by the essentially same argument to the proof of the least squares case (seethe discussion below (21)).

Next there are two conditions to verify in (B.2.4). We first focus on the uniform convergence of thesample criterion function. Here we need to verify the uniform convergence over FR(N) such that

supF∈FR(N)

∣∣∣QN (F )−Q(F )∣∣∣ p→ 0. (26)

For this purpose, it is convenient to view QN (F ) and Q(F ) as functions of θ ∈ ∆R(N) and write themas QN (θ) and Q(θ), respectively. Then define, for any R, the class of measurable functions

GMLR =

l(y, x, θ) =

J∑j=1

yj log

(∑r

θrgj(x, βr)

): θ ∈ ∆R

. (27)

Note (i) QN (θ) = −N−1∑N

i=1 l(yi, xi, θ), (ii) ((yi, xi))Ni=1 are i.i.d., and (iii)E

[supθ∈∆R(N)

|l(y, x, θ)|]<

∞. Then by Pollard (1984, Theorem II.24) (also see Chen (2007, Section 3.1, page 5592) for relateddiscussion), the entropy condition becomes logN(ε,GML

R , || · ||L1,N )/Np→ 0 for all ε > 0. Using the

inequality that for 0 < c ≤ a, b, | log a− log b| ≤ |a− b|/c, we obtain for some constant C

|l(y, x, θ1)− l(y, x, θ2)| =

∣∣∣∣∣∣J∑j=1

yj

{log

(∑r

θr1gj(x, βr)

)− log

(∑r

θr2gj(x, βr)

)}∣∣∣∣∣∣≤ C

J∑j=1

yj

∣∣∣∣∣∑r

θr1gj(x, βr)−

∑r

θr2gj(x, βr)

∣∣∣∣∣≤ C

J∑j=1

yj

{∑r

gj(x, βr) |θr1 − θr2|

}≤M(·)

∑R

r=1|θr1 − θr2| ≤M(·)

√R||θ1 − θ2||E

with some function E[M(·)2] < ∞. Therefore, the first condition in (B.2.4) holds by the essentiallysame argument to the proof of the least squares case. Finally, the second condition in (B.2.4) alsoholds by the essentially same argument to the proof of the least squares case. We have verified all theconditions in Lemma 2 (Lemma A.2 of CP) for the ML estimator.

36

B.3 Proof of Theorem 3

We can prove Theorem 3 by verifying conditions (B.2.1)–(B.2.4) in Lemma 2, just as in the proof ofTheorem 1. Observe that (B.2.1)–(B.2.2) are clearly satisfied because the arguments that (B.2.1)–(B.2.2) are satisfied within the proof of Theorem 1 follow almost exactly if we replace g(x, βr), FR, F ,and Q(F ) with g(x, λr), Pλ,R, Pλ, and Q(Pλ). The verification of (B.2.3) also follows the argumentsin the proof of Theorem 1, instead using QN (Pλ) and ΛR. Condition (B.2.4), regarding the uniformconvergence of QN (Pλ) to Q(Pλ), is satisfied by invoking the same arguments as in the proof Theorem1, using QN (Pλ) and Q(Pλ). Other arguments are also essentially identical to Theorem 1. ThereforedLP

(Pλ,N , Pλ,0

)p→ 0, which also implies dLP

(FN , F0

)p→ 0 because we assume F0 ∈ FM and because

F =∫

Φ(·|λ)dPλ(dλ) is continuous on Pλ in the Lévy-Prokhorov metric. We omit a complete proof ofTheorem 3 to eliminate redundancy.

B.4 Proof of Theorem 4

Define gS(x, λr) = 1S

∑Ss=1 g(x, βr,s) where βr,s are drawn from Φ(β|λr) and define QN,S(Pλ) =

N−1∑N

i=1

∥∥yi −∑r θrgS(xi, λ

r)∥∥2

E/J . Then define the estimator of Pλ,0 as

Pλ,N,S = argminPλ∈Pλ,R(N)QN,S(Pλ) + C · νN (28)

with some tolerance of minimization, C · νN , that tends to zero (if this is necessary).We prove the consistency by verifying conditions in Lemma 2 for the simulated distribution func-

tion estimator. Note that (B.2.1)–(B.2.2) are clearly satisfied because the proofs of (B.2.1)-(B.2.2)in Theorem 1 hold by replacing g(x, βr), FR , F , and Q(F ) with gS(x, λr), Pλ,R, Pλ, and Q(Pλ),respectively, when necessary. We focus on (B.2.3) and (B.2.4).

To show (B.2.3) holds, we use Remark A.1.(i)(a) of CP. First note that Pλ,R is a compact subsetof Pλ for each R because ΛR is a compact subset of Λ. Second we need to show that for any data((yi, xi))

Ni=1, QN,S(Pλ) is continuous on Pλ,R for each R ≥ 1. Since Pλ,R is compact, this continuity

means our estimator is well defined as the minimum in (28). Note∫gS (x, λ) dPλ,l =

∑Rr=1 θ

rl gS (x, λr)

for Pλ,l ∈ Pλ,R, l = 1, 2. Then, for any Pλ,1, Pλ,2 ∈ Pλ,R(N), applying the triangle inequality, we obtain

∣∣∣QN,S(Pλ,1)− QN,S(Pλ,2)∣∣∣ ≤ 2

∑N

i=1

∑J

j=1yi,j

∣∣∣∣∫ gSj (xi, λ)(dPλ,1 − dPλ,2)

∣∣∣∣ /NJ+

∑N

i=1

∑J

j=1

{∫gSj (xi, λ)(dPλ,1 + dPλ,2)

} ∣∣∣∣∫ gSj (xi, λ)(dPλ,1 − dPλ,2)

∣∣∣∣ /NJ≤ 4

∑N

i=1

∑J

j=1

∣∣∣∣∫ gSj (xi, λ)(dPλ,1 − dPλ,2)

∣∣∣∣ /NJ,where the second inequality holds because yi,j , gSj (xi, λ), and

∫gSj (xi, λ)dPλ are uniformly bounded

by 1 for all j and xi. Then because gSj (xi, λ) is uniformly bounded by 1 and Pλ,1 and Pλ,2 are discretedistributions with the finite support ΛR, weak convergence implies that almost surely QN,S(Pλ) iscontinuous on Pλ,R, i.e. for any Pλ,1, Pλ,2 ∈ Pλ,R such that dLP (Pλ,1, Pλ,2) → 0, it follows that|QN,S(Pλ,1) − QN,S(Pλ,2)| → 0 almost surely. Therefore, by Remark A.1.(i) (a) of CP, (B.2.3) holdswith simulated basis functions with any S.

37

Next we verify (B.2.4). (B.2.4) (ii) is clearly satisfied as in Theorem 1. We focus on (B.2.4) (i): theuniform convergence of QN,S(Pλ) to Q(Pλ) for Pλ ∈ Pλ,R(N), supPλ∈Pλ,R(N)

|QN,S(Pλ)−Q(Pλ)| p→ 0. Asin Theorem 1 it is convenient to view QN,S(Pλ) and Q(Pλ) for Pλ ∈ Pλ,R(N) as functions of θ ∈ ∆R(N)

and then write them as QN,S(θ) and Q(θ), respectively. Then the uniform convergence condition toverify becomes

supθ∈∆R(N)

∣∣∣QN,S(θ)−Q(θ)∣∣∣ p→ 0. (29)

Define, for any S, QS(θ) ≡ E[∥∥yi −∑r θ

rgS (xi, λr)∥∥2

E/J], where the expectation is taken with

respect to (yi, xi) while fixing the simulation draws in gS (xi, λr). We first show that

supθ∈∆R(N)

∣∣∣QN,S(θ)−QS(θ)∣∣∣ p→ 0.

Later we show supθ∈∆R(N)|QS(θ)−Q(θ)| p→ 0; therefore invoking the triangle inequality we obtain

(29).For any R define the class of measurable functions

GR,S =

l(y, x, θ) =

∥∥∥∥∥y −∑r

θrgS(x, λr)

∥∥∥∥∥2

E

/J : θ ∈ ∆R

.

Note (i) QN,S(θ) = N−1∑N

i=1 l(yi, xi, θ), (ii) {(yi, xi)}Ni=1 are i.i.d., and (iii) E[supθ∈∆R(N)

|l(y, x, θ)|] ≤1 < ∞. Then by Pollard (1984, Theorem II.24) (also see Chen (2007, Section 3.1, page 5592) forrelated discussion), the uniform convergence (29) holds if and only if logN(ε,GR,S , || · ||L1,N )/N

p→ 0

for all ε > 0, where || · ||L1,N denotes the L1(PN )-norm and PN denotes the empirical measure of thedata {(yi, xi)}Ni=1. Then following the same steps in the proof of Theorem 1, we conclude as long asR(N) logR(N)/N → 0, the uniform convergence of QN,S(θ) to QS(θ) holds for any given S.

Next we show supθ∈∆R(N)|QS(θ) − Q(θ)| p→ 0 in P∗ where P∗ denotes probability measure w.r.t.

38

the simulation draws. Consider that

|QS(θ)−Q(θ)|

= E

∥∥∥∥∥yi −∑r

θrgS(xi, λr)

∥∥∥∥∥2

E

−

∥∥∥∥∥yi −∑r

θrg(xi, λr)

∥∥∥∥∥2

E

≤ J−1

J∑j=1

E

∣∣∣∣∣∣(yi,j −

∑r

θrgSj (xi, λr)

)2

−

(yi,j −

∑r

θrgj(xi, λr)

)2∣∣∣∣∣∣

≤ J−1J∑j=1

E

[∣∣∣∣∣(

2yi,j −∑r

θrgSj (xi, λr)−

∑r

θrgj(xi, λr)

)(∑r

θrgSj (xi, λr)−

∑r

θrgj(xi, λr)

)∣∣∣∣∣]

≤ 2J−1J∑j=1

E

[∣∣∣∣∣∑r

θrgSj (xi, λr)−

∑r

θrgj(xi, λr)

∣∣∣∣∣]

≤ 2J−1J∑j=1

∑r

θrE[∣∣gSj (xi, λ

r)− gj(xi, λr)∣∣]

≤ 2J−1J∑j=1

∑r

θr maxrE

[∣∣∣∣∣ 1SS∑s=1

{gj(xi, β

r,s)−∫gj(xi, β)dΦ(β|λr)

}∣∣∣∣∣]

= oP∗(1),

where the third inequality holds because 0 <∑

r θrgSj (xi, λ

r) ≤ 1 and 0 <∑

r θrgj(xi, λ

r) ≤ 1 andthe last result holds because

∑Rr=1 θ

r = 1 and βr,s are iid draws from Φ(β|λr), so E∗[gj(x, βr,s)] =∫gj(x, β)dΦ(β|λr) < ∞ for each r and we applied a LLN to the term inside the | · | bracket for each

r, which holds for any x ∈ X . Note that the convergence of the partial sum inside the | · | bracket isindependent of r because we use S simulation draws independently drawn for each λr.

Here E∗[·] denotes expectation w.r.t. the simulation draw. Note that the above oP∗(1) result doesnot depend on θ and therefore we also have supθ∈∆R(N)

|QS(θ)−Q(θ)| p→ 0. Then invoking the triangleinequality we obtain the uniform convergence in (29).

We have verified all the conditions in Lemma 2 (Lemma A.2 of CP) and therefore showed that Pλ,N,Sis a consistent estimator for Pλ,0. This in turn implies the consistency of FN,S =

∫Φ(·|λ)Pλ,N,S(dλ)

to F0 =∫

Φ(·|λ)Pλ,0(dλ) because Φ(β|λ) is continuous in λ for all β and thus F =∫

Φ(·|λ)dPλ(dλ) isalso continuous on Pλ in the Lévy-Prokhorov metric.

C Proofs for Asymptotic Bounds Results

We let C, C1, C2, . . . denote generic positive constants. We use diag (A) to denote a diagonal matrixcomposed of diagonal elements of a matrix A. We often use the following inequality (denoted by RCS

for the Cauchy-Schwarz inequality for R terms in a sum):∑R

r=1Wr ≤√R√∑R

r=1W2r for a sequence

Wr’s. We note ‖P (x, F0)‖∞ ≤ ζ0 and ‖g (x, βr)‖∞ ≤ ζ0 for some constant ζ0 > 0 uniformly overr ≤ R(N) because probabilities are bounded by one. We also note for some c0 > 0, ‖g(x, βr)‖L2

≥ c0

uniformly over r ≤ R(N) because X , the support of X, has positive measure and the support of β isbounded. We first present preliminary lemmas that are useful to prove Theorem 5 and Theorem 6.

39

We define ΨN,R =(

1N

∑Ni=1

∑Jj=1 gj(xi, β

r)gj(xi, βr′))

1≤r,r′≤R(a R×R matrix).

Lemma 3. Suppose X has positive measure and Assumption 4(i)-(iii) hold. Then

min{

Pr{||g (·, βr) ||2L2,N

≤ 2||g(·, βr)||2L2, ∀r

},Pr

{||g(·, βr)||L2 ≤ 2||g(·, βr)||L2,N

, ∀r}}≥

1−R exp(−C1Nc

40/ζ

40

).

Proof. The claim follows from the union bound applied to the r-specific events and Hoeffding’s in-equality.

Lemma 3 implies that diag (ΨN,R) ≤ 2diag (ΨR) holds with probability approaching one (w.p.a.1)because ‖g (·, βr)‖2L2,N

’s are diagonal elements of ΨN,R and ‖g (·, βr)‖2L2’s are diagonal elements of ΨR.

The purpose of the following Lemma 4 is to show that ΨN,R ≥ ΨR/2 holds w.p.a.1.

Lemma 4. Let G = span{g(·, β1), . . . , g(·, βR)} be the linear space spanned by functions g(·, β1), . . . , g(·, βR).Suppose Assumptions 4 (i)-(iii) hold. Then Pr

{supµ∈G\{0}(‖µ‖

2L2/ ‖µ‖2L2,N

) > 2}≤ R2 exp

(−C2N/(ζ

40R

2)).

Proof. Let φ1, . . . , φM be an orthonormal basis of G in L2($) with M ≤ R. Also let ρ(D) denote thefollowing quantity for a symmetric matrix D: ρ(D) = sup

∑l |al|

∑l′ |al′ |

∣∣Dl,l′∣∣ , where the sup is taken

over sequences {al}Ml=1 with∑M

l=1 a2l = 1. Then, following Lemma 5.2 in Baraud (2002), we have

Pr

{sup

µ∈G\{0}

‖µ‖2L2

‖µ‖2L2,N

> c

}≤M2 exp

(−N ($0 − c−1)2

4$1 max{ρ2 (A) , ρ (B)

}) (30)

where Al,l′ =√E[||φl||2E ||φl′ ||2E ] and Bl,l′ =

∥∥∥∥φl∥∥E∥∥φl′∥∥E∥∥∞ for l, l′ = 1, . . . ,M and $0 and $1

denote the lower bound and upper bound of the density of X, respectively. We find∣∣Al,l′∣∣ ≤ ζ2

0 and∣∣Bl,l′∣∣ ≤ ζ20 . It follows that

ρ(A) ≤ ζ20 sup

∑l|al|∑

l′|al′ | = ζ2

0 sup(∑

l|al|)2≤ ζ2

0 supM∑

l|al|2 = ζ2

0M ≤ ζ20R

where (∑

l |al|)2 ≤M

∑l |al|

2 holds by the Cauchy-Schwarz inequality. Similarly we have ρ(B) ≤ ζ20R.

The conclusion follows from (30) and M ≤ R.

Now define an event E0 ={

ΨN,R −(ξmin(R)/4ζ2

0

)diag(ΨN,R) ≥ 0

}.

Lemma 5. Suppose X has positive measure and Assumption 4(i)-(iii) hold. Then,Pr{E0} ≥ 1−R exp

(−C1Nc

40/ζ

40

)−R2 exp

(−C2N/ζ

40R

2)

Proof. First, note that ΨR − (ξmin(R)/ζ20 )diag (ΨR) ≥ 0. Now let G be the linear space spanned by

g(·, β1), . . . , g(·, βR). Now note that, under the eventA ={||g(·, βr)||2L2,N

≤ 2||g(·, βr)||2L2∀r = 1, . . . , R

},

we have diag(ΨN,R) ≤ 2diag(ΨR). Also, under the event B ={

supµ∈G\{0}

(‖µ‖2L2

/ ‖µ‖2L2,N

)≤ 2},

we have ΨN,R ≥ ΨR/2. Therefore, under the intersection of the two events, A and B, we have

ΨN,R − ξmin(R)diag (ΨN,R) /(4ζ20 ) ≥ ΨR/2− 2ξmin(R)diag (ΨR) /(4ζ2

0 ) ≥ 0,

40

or event E0. Then by Lemma 3 and Lemma 4, we find 1 − Pr{E0} ≤ R · exp(−C1Nc

40/ζ

40

)+ R2 ·

exp(−C2N/ζ

40R

2)because Pr{Ec0} ≤ Pr{Ac ∪Bc} ≤ Pr{Ac}+ Pr{Bc}.

Lemma 6. Suppose X has positive measure and Assumption 4(i)-(iii) hold. Then, for given positivesequence ηN

Pr{∣∣∣∑

ie′ig(Xi, β

r)/N∣∣∣ ≤ ηN ||g(·, βr)||L2,N

for all r = 1, . . . , R}

≥ 1−R · exp(−CNη2

Nc20/ζ

20

)−R · exp

(−C1Nc

40/ζ

40

).

Proof. Hoeffding (1963)’s inequality implies that

EX

[Pr{∣∣∣∑

ie′ig(Xi, β

r)/N∣∣∣ ≥ ηN ||g(·, βr)||L2,N

, ∀r ≤ R |X1, . . . , XN

}](31)

≤ EX

[∑r

exp(−2Nη2

N ||g(·, βr)||2L2,N/4Jζ2

0

)]because E [e′ig (Xi, β

r) | X1, . . . , XN ] = 0, choice probabilities lie in [0, 1], and −√Jζ0 ≤ e′ig(Xi, β

r) ≤√Jζ0 (by the Cauchy-Schwarz inequality) uniformly.Now note under the event

{||g(·, βr)||L2 ≤ 2||g(·, βr)||L2,N

, ∀r = 1, . . . , R},

∑R

r=1exp

(−Nη2

N ‖g (·, βr)‖2L2,N/2Jζ2

0

)≤

∑R

r=1exp

(−Nη2

N ‖g (·, βr)‖2L2/8Jζ2

0

)(32)

≤∑R

r=1exp

(−Nη2

Nc20/8Jζ

20

)= R exp

(−CNη2

Nc20/ζ

20

).

From (31)-(32) and Lemma 3, the claim follows.

The following Lemma 7 decomposes the error bound into a bias term and a variance term.

Lemma 7. Suppose X has positive measure and Assumption 4(i)-(iii) hold. Then, for any N ≥ 1,R ≥ 2, and a > 1, we have for all F ∈ FR(N) ,

∥∥∥P (xi, FN )− P (xi, F0)∥∥∥2

L2,N

≤ a+ 1

a− 1‖P (xi, F )− P (xi, F0)‖2L2,N

+ Cη2NRζ

20

ξmin(R)

a2

a− 1,

where the inequality holds with probability greater than 1− pN,R,pN,R ≡ R exp

(−CNη2

Nc20/ζ

20

)+ 2R exp

(−C1Nc

40/ζ

40

)+R2 exp

(−C2N/ζ

20R

2).

Proof. Because P (xi, FN ) (i.e., FN ) is the solution of the minimization problem in (6), we have

||P (xi, FN )− yi||2L2,N≤ ||P (xi, F )− yi||2L2,N

(33)

for any F ∈ FR(N). Now note that

∥∥∥P (xi, FN )− yi∥∥∥2

L2,N

=∥∥∥P (xi, FN )− P (xi, F0) + P (xi, F0)− yi

∥∥∥2

L2,N

=∥∥∥P (xi, FN )− P (xi, F0)

∥∥∥2

L2,N

− 2∑

ie′i

(P (xi, FN )− P (xi, F0)

)/N + ||ei||2L2,N

, (34)

41

where we use the definition ei = yi − P (xi, F0). Similarly we obtain

∥∥∥P (xi, FN )− yi∥∥∥2

L2,N

= ‖P (xi, F )− P (xi, F0)‖2L2,N−2∑

ie′i (P (xi, F )− P (xi, F0)) /N+||ei||2L2,N

.

(35)

Subtracting (35) from (34) and by (33), we obtain∥∥∥P (xi, FN )− P (xi, F0)∥∥∥2

L2,N

≤ ‖P (xi, F )− P (xi, F0)‖2L2,N+ 2

∑i

e′i(P (xi, FN )− P (xi, F ))/N.

Let VN,r = 1N

∑Ni=1 e

′ig(xi, β

r). Then, 1N

∑i e′i(P (xi, FN ) − P (xi, F )) =

∑Rr=1 VN,r(θ

r − θr) by con-struction of P (xi, FN ) and P (xi, F ) and we obtain by the triangle inequality,∥∥∥P (xi, FN )− P (xi, F0)

∥∥∥2

L2,N

≤ ‖P (xi, F )− P (xi, F0)‖2L2,N+ 2

∑r

|VN,r| · |θr − θr|. (36)

Define the event E1 =⋂Rr=1

{|VN,r | ≤ ηN ||g(·, βr)||L2,N

}. Then under {E0 ∩E1}, we have

∑R

r=1V 2N,r(θ

r − θr)2 ≤ η2N

∑R

r=1||g(·, βr)||2L2,N

(θr − θr)2 (37)

= η2N

1

N

∑N

i=1

∑R

r=1(θr − θr)2||g(xi, β

r)||2E = η2N (θ − θ)′diag(ΨN,R)(θ − θ)

≤ η2N

(ξmin(R)/4ζ2

0

)−1(θ − θ)′ΨN,R(θ − θ) = η2

N

(ξmin(R)/4ζ2

0

)−1 ||P (xi, FN )− P (xi, F )||2L2,N.

From (36) and (37), it follows that under the event {E0 ∩E1},∥∥∥P (xi, FN )− P (xi, F0)∥∥∥2

L2,N

≤ ‖P (xi, F )− P (xi, F0)‖2L2,N+ 2

∑R

r=1|VN,r| · |θr − θr|

≤ ‖P (xi, F )− P (xi, F0)‖2L2,N+ 2√R

(∑R

r=1V 2N,r

(θr − θr

)2)1/2

≤ ‖P (xi, F )− P (xi, F0)‖2L2,N+ 2ηN

√R(ξmin(R)/4ζ2

0

)−1/2 ||P (xi, FN )− P (xi, F )||L2,N

≤ ‖P (xi, F )− P (xi, F0)‖2L2,N

+2ηN√R(ξmin(R)/4ζ2

0

)−1/2(∥∥∥P (xi, FN )− P (xi, F0)

∥∥∥L2,N

+ ‖P (xi, F )− P (xi, F0)‖L2,N

)by the Cauchy-Schwarz inequality, RCS, and the triangle inequality. Applying the inequality 2xy ≤x2a+ y2/a (any x,y, a > 0) to x = ηN

√R(ξmin(R)/4ζ2

0 )−1/2 and y = ||P (xi, FN )−P (xi, F0)||L2,Nand

to x = ηN√R(ξmin(R)/4ζ2

0 )−1/2 and y = ||P (xi, F )−P (xi, F0)||L2,N, respectively, we obtain under the

event {E0 ∩E1},∥∥∥P (xi, FN )− P (xi, F0)∥∥∥2

L2,N

≤ ‖P (xi, F )− P (xi, F0)‖2L2,N+∥∥∥P (xi, FN )− P (xi, F0)

∥∥∥2

L2,N

/a

+ ‖P (xi, F )− P (xi, F0)‖2L2,N/a+ 2aη2

NR(ξmin(R)/4ζ2

0

)−1

42

It follows that under the event {E0 ∩E1}, for all a > 1,∥∥∥P (xi, FN )− P (xi, F0)∥∥∥2

L2,N

≤ a+ 1

a− 1‖P (xi, F )− P (xi, F0)‖2L2,N

+ Cη2NRζ

20

ξmin(R)

a2

a− 1.

The conclusion follows from Lemma 6.

C.1 Proof of Theorem 5

We first obtain the convergence rates of the choice probability as

Lemma 8. Suppose the assumptions and the conditions in Theorem 5 hold. Then, we have∥∥∥P (xi, FN )− P (xi, F0)∥∥∥2

L2,N

≤ OP

(R(N) logR(N)

ξmin(R(N)) ·N

)+ C · inf

F∈FR(N)

‖P (xi, F )− P (xi, F0)‖2L2,N.

Then the result of Theorem 5 follows from Lemma 8 combined with Lemma 1. Lemma 8 derivesthe variance term of the asymptotic bounds and Lemma 1 derives the bias term. We prove Lemma 8below.

C.2 Proof of Lemma 8

From Lemma 7, the fastest convergence rate will be obtained when the order of ηN is as small aspossible while keeping pN,R → 0. By inspecting pN,R, we note that the optimal rate is obtained whenwe choose ηN = C

√logR(N)/N since the first term in pN,R dominates the second term in pN,R when

ηN is small enough and pN,R → 0 with this choice of ηN . The inspection of the third term in pN,Rreveals that we also require R(N) should satisfy R(N)2 logR(N)/N → 0 so that pN,R → 0. The resultof Lemma 8 follows from these requirements, Lemma 7 and∥∥∥P (xi, FN )− P (xi, F0)

∥∥∥2

L2,N

≤ OP

(R(N) logR(N)

ξmin(R(N)) ·N

)+ C · ‖P (xi, F )− P (xi, F0)‖2L2,N

because C η2NR(N)ζ20ξmin(R(N))

a2

a−1 = O(R(N) logR(N)ξmin(R(N))·N

)under our choice of ηN and because Lemma 7 holds for

any F ∈ FR(N).

C.3 Proof of Lemma 1

First we construct approximating power series with the length of L tensor products of higher orderpolynomials of βk’s in β as (ϕ1(β), . . . , ϕl(β), . . . , ϕL(β)), where ϕl(β) is the lth element in the L numberof tensor product polynomials. The tensor products are defined by the functions ϕl(β) = βl11 β

l22 · · ·β

lKK

with β = (β1, β2, . . . , βK) ∈ B =∏Kk=1[β

k, βk] and lk’s are exponents of each βk. For example, we can

let ϕ1(β) = 1, ϕ2(β) = β1, . . . , and ϕl(β) = βl11 βl22 · · ·β

lKK .

Now note that each gj (x, β) f(β) is a member of the Hölder class (s-smooth) of functions sinceit is uniformly bounded and all of its own and partial derivatives up to the order of s are also uni-formly bounded by our restriction on f in H and Assumption 4 (iv). Therefore, we can approximate

43

gj (x, β) f(β) well using power series (see Chen, 2007) and obtain the approximation error rate due toTiman (1963) as (we suppress dependence of al and ϕl on j for notational simplicity)

supβ∈B

∣∣∣∣gj (x, β) f(β)−∑L

l=1al(x)ϕl(β)

∣∣∣∣ = O(L−s/K

)(38)

for all x ∈ X . Let(βk

= bk,1 < bk,2 < · · · < bk,rk+1 = βk, k = 1, . . . ,K)be partitions of the intervals

[βk, βk], k = 1, . . . ,K, into r1, . . . , rK subintervals, respectively. Then we can define the R = r1r2 · · · rK

number of subcubes as {Cι1,...,ιK =∏Kk=1[bk,ιk , bk,ιk+1], ιk = 1, 2, . . . , rk}, which become a partition

P (B) of B. For any choice of R points

(bι1,...,ιK ∈ Cι1,...,ιK | ιk = 1, 2, . . . , rk,k = 1, . . . ,K)

(one bι1,...,ιK for each of R subcubes), now we can approximate a Riemann integral of∑L

l=1 al(x)ϕl(β)

using a quadrature method with R distinct weights

(c(ι1, . . . , ιK) ≡ c(bι1,...,ιK ) | ιk = 1, . . . , rk, k = 1, . . . ,K) such that

∫ ∑L

l=1al(x)ϕl(β)dβ =

∑L

l=1al(x)

∫ϕl(β)dβ

=∑L

l=1al(x)

∑Cι1,...,ιK∈P (B)

c(ι1, . . . , ιK)ϕk(bι1,...,ιK ) +R(δR)

where R(δR) denotes a remainder term with δR = max {diam(Cι1,...,ιK ) : Cι1,...,ιK ∈ P (B)}. Withoutloss of generality, we will pick δR = C · R−1/K . Noting that ϕl(β) is a product of polynomials inβk’s by construction, we can apply Theorem 6.1.2 (Generalized Cartesian Product Rules) of Krommerand Ueberhuber (1998) and so we can approximate multivariate integrals with products of univariateintegrals. Note that

∫ϕl(β)dβ =

∏Kk=1

∫ϕl,k(βk)dβk with ϕl(β) =

∏Kk=1 ϕl,k(βk). Therefore if we

approximate∫ϕl,k(βk)dβk using a univariate quadrature with weights {ck(1), . . . , ck(rk)}, generally we

obtain∫ϕl,k(βk)dβk =

∑rkιk=1 ck(ιk)ϕl,k(bk,ιk) +Rl,k(δR) where Rl,k(δR) denotes a possible remainder

term. Now we can make the univariate quadrature even become accurate (or exact) at least up to theorder of rk (Theorem 5.2.1 in Krommer and Ueberhuber (1998)) i.e.,

∫βpkdβk =

∑rkιk=1 ck(ιk)b

pk,ιk

withsuitable choice of ck(ιk) ≡ ck(bk,ιk) for all p ≤ rk such that Rl,k(δR) = 0.

For notational simplicity, we take rk = r1 for all k. Then with the L = (r1 + 1)K = (R1/K + 1)K

number of power series, we can include powers and cross products of βk’s at least up to the order ofr1. With the choice of L = (r1 + 1)K and ck(ιk)’s that make the univariate quadrature exact at leastup to the order of r1, we can let∫

ϕl(β)dβ =∏K

k=1

∫ϕl,k(βk)dβk =

∏K

k=1

∑rk

ιk=1ck(ιk)ϕl,k(bk,ιk) (39)

44

for l = 1, . . . , L. By adding and subtracting terms, it follows that for some F ∗ defined later

Pj (x, F )− Pj (x, F ∗) =

∫f(β)gj (x, β) dβ −

∫ ∑L

l=1al(x)ϕl(β)dβ (40)

+∑L

l=1al(x)

∫ϕl(β)dβ −

∑L

l=1al(x)

∏K

k=1

∑rk

ιk=1ck(ιk)ϕl,k(bk,ιk) (41)

+∑L

l=1al(x)

∏K

k=1

∑rk

ιk=1ck(ιk)ϕl,k(bk,ιk)− Pj (x, F ∗) . (42)

We first bound (40) by the triangle inequality and (38), (40) ≤∫ ∣∣∣f(β)gj (x, β)−

∑Ll=1 al(x)ϕl(β)

∣∣∣ dβ =

O(L−s/K · vol(B)). Second note that (41) becomes zero due to (39).Next construct ϕl(br), r = 1, . . . , R such that

∑Rr=1 ϕl(b

r) =∏Kk=1

{∑rkιk=1 ck(ιk)ϕl,k(bk,ιk)

}with

R = r1 · · · rK and br = (br1, br2,, . . . , b

rK) in b ≡ {b : b = (b1,ι1 , . . . , bK,ιK ), ι1 = 1, . . . , r1, . . . , ιK = 1, . . . , rK}.

Then, for any choice of br ∈ b we can always write ϕl(br) = c1(b1,ι1) · · · cK(bK,ιK )ϕl(b) for someb = (b1,ι1 , . . . , bK,ιK ) in b. Therefore without loss of generality write

R∑r=1

ϕl(br) =

∏K

k=1

rk∑ιk=1

ck(ιk)ϕl,k(bk,ιk)

=

R∑r=1

c1(br) · · · cK(brK)ϕl(br). (43)

Define θ∗r = c1(br1) · · · cK(brK)f(br)/{∑R

r=1 c1(br1) · · · cK(brK)f(br)}

and let F ∗ be constructed usingthe weights θ∗. It follows that∣∣∣∣∑L

l=1al(x)

∏K

k=1

∑rk

ιk=1ck(ιk)ϕl,k(bk,ιk)− Pj (x, F ∗)

∣∣∣∣≤

R∑r=1

∣∣∣∣∑L

l=1al(x)ϕl(b

r)− θ∗rgj (x, br)

∣∣∣∣=

R∑r=1

∣∣∣ ∑Ll=1 al(x)c1(br1) · · · cK(brK)ϕl(b

r)− c1(br1)···cK(brK)f(br)∑Rr=1 c1(br1)···cK(brK)f(br)

gj (x, br)∣∣∣

≤R∑r=1

|c1(br1) · · · cK(brK)|

∣∣∣∣∣L∑l=1

al(x)ϕl(br)− f(br)∑R

r=1 c1(br1) · · · cK(brK)f(br)gj (x, br)

∣∣∣∣∣≤

R∑r=1

|c1(br1) · · · cK(brK)| |∑L

l=1al(x)ϕl(b

r)− f(br)gj (x, br) |

+

R∑r=1

|c1(br1) · · · cK(brK)|

∣∣∣∣∣∑R

r=1 c1(br1) · · · cK(brK)f(br)− 1∑Rr=1 c1(br1) · · · cK(brK)f(br)

∣∣∣∣∣ |f(br)gj (x, br)|

= O(R ·R−1L−s/K

)where the first inequality holds by the triangle inequality and (43) and the first equality holds by(43) and by the definition of θ∗r. In the above note that |c1(br1) · · · cK(brK)| is at most O(R−1), whichis obtained if one uses the uniform weights, i.e., ck(b1k) = . . . = ck(b

Rk ) for k = 1, . . . ,K. Second

note that∣∣∣∑L

l=1 al(x)ϕl(br)− f(br)gj (x, br)

∣∣∣ = O(L−s/K) due to the bound in (38). Third note that

45

∑Rr=1 c1(br1) · · · cK(brK)f(br) is another quadrature approximation of the integral

∫B f(β)dβ = 1. Be-

cause f(β) itself belongs to a Hölder class, the approximation error rate of this integral becomes∣∣∣∑Rr=1 c1(br1) · · · cK(brK)f(br)− 1

∣∣∣ = O(L−s/K

). To see this, note that

∣∣∣∣∣∫Bf(β)dβ −

R∑r=1

c1(br1) · · · cK(brK)f(br)

∣∣∣∣∣≤

∣∣∣∣∫Bf(β)dβ −

∫B

∑L

l=1alϕl(β)dβ

∣∣∣∣ (44)

+

∣∣∣∣∑L

l=1al

∫Bϕl(β)dβ −

∑L

l=1al∏K

k=1

∑rk

ιk=1ck(ιk)ϕl,k(bk,ιk)

∣∣∣∣ (45)

+

∣∣∣∣∣∑L

l=1al∏K

k=1

∑rk

ιk=1ck(ιk)ϕl,k(bk,ιk)−

∑L

l=1al

R∑r=1

c1(br) · · · cK(brK)ϕl(br)

∣∣∣∣∣ (46)

+

∣∣∣∣∣R∑r=1

c1(br) · · · cK(brK)∑L

l=1alϕl(b

r)−R∑r=1

c1(br1) · · · cK(brK)f(br)

∣∣∣∣∣ , (47)

where al satisfies supβ∈B

∣∣∣f(β)−∑L

l=1 alϕl(β)∣∣∣ = O

(L−s/K

)(such al’s exist since f(β) belongs to a

Hölder class). Then we find (44)= O(L−s/K · vol(B)

)by the triangle inequality, (45)= 0 by (39), and

(46)= 0 by 43. Finally note (47)≤∑R

r=1 c1(br) · · · cK(brK)∣∣∣∑L

l=1 alϕl(br)− f(br)

∣∣∣ = O(R ·R−1 · L−s/K

)by the triangle equality, |c1(br) · · · cK(brK)| is at most O(R−1), and because al satisfiessupβ∈B

∣∣∣f(β)−∑L

l=1 alϕl(β)∣∣∣ = O

(L−s/K

).

Combining these results, we bound (42) as O(L−s/K). Combining these bounds, we then conclude

|Pj (x, F )− Pj (x, F ∗)| = O(L−s/K · vol(B)

)+ 0 +O

(L−s/K

)= O

(L−s/K

)= O

(((R1/K + 1

)K)−s/K)= O

((R1/K + 1

)−s)≤ O

(R−s/K

).

The second conclusion in the lemma is trivial since the above holds for all x ∈ X and j = 1, . . . , J .

C.4 Proof of Theorem 6

First we derive the distance between θ and θ∗0 where θ∗0 (i.e. F ∗0 ) satisfies ‖P (xi, F∗0 )− P (xi, F0)‖2L2,N

=

OP(R−2s/K

). Note that such F ∗0 exists by Lemma 1. Note that with probability approaching to one,

we have 2 ‖g (·, βr)‖L2,N≥ ‖g (·, βr)‖L2

≥ c0 > 0 for all r = 1, . . . , R by Lemma 3. It follows that

46

c0

2

∑R

r=1|θr − θ∗r0 | (48)

≤∑R

r=1||g(·, βr)||L2,N

|θr − θ∗r0 | ≤√R

(∑R

r=1||g(·, βr)||2L2,N

(θr − θ∗r0 )2

)1/2

≤(4ζ2

0/ξmin(R))1/2√

R||P (xi, FN )− P (xi, F∗0 )||L2,N

≤(4ζ2

0/ξmin(R))1/2√

R(||P (xi, FN )− P (xi, F0)||L2,N

+ ||P (xi, F∗0 )− P (xi, F0)||L2,N

)=

√R(N)/ξmin(R(N)) max

{OP(√%R,N

), OP(R(N)−s/K)

},

where the second inequality holds by RCS, the third inequality holds similarly with (37), and the fourthinequality holds by the triangle inequality. Therefore, by Lemma 1, the result of Theorem 5, and theCauchy-Schwarz inequality, the conclusion follows.

Next we have∣∣∣FN (β)− F0(β)

∣∣∣ ≤ ∣∣∣FN (β)− F ∗0 (β)∣∣∣+ |F ∗0 (β)− F0(β)| by the triangle inequality. It

is not difficult to see that

supβ∈B

∣∣∣FN (β)− F ∗0 (β)∣∣∣ = sup

β∈B

∣∣∣∣∑R

r=1θr1 [βr ≤ β]−

∑R

r=1θ∗r0 1 [βr ≤ β]

∣∣∣∣= sup

β∈B

∣∣∣∣∑R

r=1(θr − θ∗r0 )1 [βr ≤ β]

∣∣∣∣ ≤∑R

r=1|θr − θ∗r0 |,

where the last inequality holds by the triangle inequality. Note that F0(β) =∫f0(b)1 [b ≤ β] db and

F ∗0 (β) becomes a quadrature approximation. We use a similar strategy with the proof of Lemma 1where we approximate f0(b) with

∑Ll=1 af0,lϕl(b) such that supβ∈B

∣∣∣f0(b)−∑L

l=1 af0,lϕl(b)∣∣∣ = O

(L−s/K

)due to Timan (1963). We find F0(β)− F ∗0 (β) = O

(R−s/K

)a.e. β ∈ B. Therefore, the conclusion fol-

lows from this and Theorem 5.

D EM Algorithms

We use uniform random variables to generate starting points for probability weights, means and vari-ances in the EM estimators. The flexible-grid EM estimator has the following steps (suppressing the tsubscript).

1. Start with initial values β0 and θ0. Calculate

K (xi, βr0) =

J∏j=1

[gj (xi, βr0)]1(yi=j) ,

where yi denotes firm i’s decision on replacement.

Then calculate h0i,r =

θr0K (xi, βr0)∑R

r=1 θr0K (xi, βr0)

.

2. Update θ0 to θ1: θr1 =

∑i h

0i,r∑

i

∑Rr=1 h

0i,r

.

47

3. Use θ1 to estimate β1 by maximizing the log likelihood

L

(θ1, β

∣∣∣∣x) =∑i

R∑r=1

ln (θr1K (xi, βr)) .

Notice that βr only appears in the term∑

i ln (θr1K (xi, βr)). We only need to solve for βr1 by

maximizing∑

i ln (θr1K (xi, βr)) separately for each r. Repeat steps 2 and 3 until convergence,

letting β0 = β1 and θ0 = θ1. We use MATLAB’s optimizer fminunc for this step.

The fixed-grid EM estimator has the following steps.

1. Start with a fixed grid BR and initial weights θ0. Calculate K (xi, βr) and h0

i,r.

2. Update θ0 to θ1: θr1 =

∑i h

0i,r∑

i

∑Rr=1 h

0i,r

. Repeat step 2 until convergence, letting θ0 = θ1.

48

References

[1] Ackerberg, D. (2009), “A New Use of Importance Sampling to Reduce ComputationalBurden in Simulation Estimation”, Quantitative Marketing and Economics, 7(4), 343–376.

[2] Aliprantis, C.D. and K.C. Border (2006), Infinite Dimensional Analysis: A Hitchhiker’sGuide, Springer.

[3] Andrews D.K.W. (2002), “Generalized Method of Moments Estimation When a Parameteris on a Boundary,” Journal of Business & Economics Statistics 20-4, 530–544.

[4] Bajari, P., J.T. Fox and S. Ryan (2007), “Linear Regression Estimation of Discrete ChoiceModels with Nonparametric Distributions of Random Coefficients”, American EconomicReview, 72, 2, 459–463.

[5] Baraud, Y. (2002), “Model Selection for Regression on a Random Design”, ESAIM Prob-ability & Statistics 7, 127-146.

[6] Beran, R. and Millar, P.W. (1994), “Minimum Distance Estimation in Random CoefficientRegression Models”, The Annals of Statistics, 22(4), 1976–1992.

[7] Biernacki, C., G, Celeux and G. Govaert (2003), “Choosing starting values for the EMalgorithm for getting the highest likelihood in multivariate Gaussian mixture models,”Computational Statistics & Data Analysis, 41, 561–575.

[8] Böhning, D. “Convergence of Simar’s Algorithm for Finding the Maximum LikelihoodEstimate of a Compound Poisson Process”, The Annals of Statistics, 10(3), 1006–1008.1982.

[9] Carrasco, M, J.P. Florens, and E. Renault (2007), “Linear Inverse Problems in StructuralEconometrics Estimation Based on Spectral Decomposition and Regularization”, Hand-book of Econometrics, V6B, Elsevier.

[10] Chen, X. (2007), “Large Sample Sieve Estimation of Semi-Nonparametric Models,” Hand-book of Econometrics V7, Elsevier.

[11] Chen, X, O. Linton, and I. van Keilegom (2003), “Estimation of Semiparametric ModelsWhen the Criterion Function is Not Smooth”, Econometrica, 71, 5, 1591-1608.

[12] Chen, X. and D. Pouzo (2012), "Estimation of Nonparametric Conditional Moment Mod-els with Possibly Nonsmooth Generalized Residuals", Econometrica, 80, 1, 277–321.

[13] Dempster, A.P., N.M. Laird and D.B. Rubin (1977), "Maximum likelihood from incom-plete data via the EM algorithm", Journal of the Royal Statistical Society, 39, 1, 1-38.

[14] Fox, J.T. and A. Gandhi (2015), “Nonparametric Identification and Estimation of RandomCoefficients in Multinomial Choice Models”, University of Michigan working paper.

[15] Fox, J.T. and A. Gandhi (2011), “Using Selection Decisions to Identify the Joint Distri-bution of Outcomes”, University of Michigan working paper.

49

[16] Fox J.T., K. Kim, S. Ryan, and P. Bajari (2011), “A Simple Estimator for the Distributionof Random Coefficients”, Quantitative Economics, 2, 381-418.

[17] Fox, J.T., K. Kim, S. Ryan, and P. Bajari (2012), “The Random Coefficients Logit ModelIs Identified”, Journal of Econometrics, 166(2), 204-212.

[18] Geweke, J. and M. Keane (2007), “Smoothly mixing regressions”, Journal of Econometrics,138, 2007.

[19] Hastie, T, R. Tibshirani, and J. Friedman (2009), The Elements of Statistical Learning:Data Mining, Inference, and Prediction, Springer Series in Statistics.

[20] Heckman, J. (1990), “Varieties of Selection Bias”, The American Economic Review, 1990,80 (2), 313–318.

[21] Heckman, J. and Singer, B. “Method for Minimizing the Impact of Distributional As-sumptions in Econometric Models for Duration Data”, Econometrica 52(2), 271-320.

[22] Heiss, F. and Winschel, V. (2008), “Likelihood approximation by numerical integrationon sparse grids”, Journal of Econometrics, 144(1), 62–80.

[23] Hoeffding, W. (1963), “Probability Inequalities for Sums of Independent Random Vari-ables”, Journal of the American Statistical Association, 58, 13-30.

[24] Horowitz, J. L. (1999), “Semiparametric Estimation of a Proportional Hazard Model withUnobserved Heterogeneity”, Econometrica, 67(5), 1001–1028.

[25] Huber, J. (1981, 2004) Robust Statistics, Wiley.

[26] Ichimura, H. and T.S. Thompson (1998), “Maximum likelihood estimation of a binarychoice model with random coefficients of unknown distribution,” Journal of Econometrics,86(2), 269–295.

[27] Jacobs, R. , Jordan, M. I., Nowlan, S., and Hinton, G (1991), “Adaptive mixtures of localexperts,” Neural Comput., 3(1), 79–87.

[28] Kamakura, W.A. (1991), “Estimating flexible distributions of ideal-points with externalanalysis of preferences”, Psychometrika, 56, 3, 419-431.

[29] Karlis, D. and E. Xekalaki (2003), “Choosing initial values for the EM algorithm for finitemixtures”, Computational Statistics & Data Analysis, 41, 577–590.

[30] Krantz, S.G. and H.R. Parks (2002), A Primer on Real Analytic Functions, second edition,Birkhauser.

[31] Krommer, A.R. and C.W. Ueberhuber (1998), Computation Integration, SIAM.

[32] Laird, N. (1978), “Nonparametric Maximum Likelihood Estimation of a Mixing Distribu-tion”, Journal of the American Statistical Association, Vol. 73, No. 364, pp. 805–811.

[33] Li, J.Q. and A.R. Barron (2000), “Mixture density estimation”, Advances in Neural Infor-mation Processing Systems, Vol. 12, pp. 279–285.

50

[34] Lindsay, B.G. (1983) “The Geometry of Mixture Likelihoods: A General Theory”, TheAnnals of Statistics, 11(1), 86–94.

[35] Matzkin, R.L. (2007) “Heterogeneous Choice”, Advances in Economics and Econometrics,Theory and Applications, Ninth World Congress of the Econometric Society. Cambridge.

[36] McLachlan, G.J. and D. Peel (2000), Finite Mixture Models. Wiley.

[37] Newey, W.K. (1997), “Convergence Rates and Asymptotic Normality for Series Estima-tors”, Journal of Econometrics 79, 147-168.

[38] Parthasarathy, K.R. (1967), Probability Measures on Metric Spaces, Academic Press.

[39] Petersen, B.E. (1983), Introduction to the Fourier Transform and Pseudo-DifferentialOperators, Pitman Publishing, Boston.

[40] Pilla, R.S. and B.G. Lindsay (2001), “Alternative EM methods for nonparametric finitemixture models”, Biometrika, 88, 2, 535–550.

[41] Pollard, D. (1984), Convergence of Statistical Processes, Springer-Verlag, New York.

[42] Quandt, R.E. and Ramsey, J.B. (1978), "Estimating Mixtures of Normal Distributionsand Switching Regressions," Journal of the American Statistical Association, 73, 364,730-738.

[43] Rust, J. (1987), “Optimal Replacement of GMC Bus Engines: An Empirical Model ofHarold Zurcher”, Econometrica, 55(5): 999–1033.

[44] Seidel, W., K. Mosler and M. Alker (2000), “A Cautionary Note on Likelihood Ratio Testsin Mixture Models”, Annals of the Institute of Statistical Mathematics, 52, 3, 418-487,

[45] Teicher, H. (1963), “Identifiability of Finite Mixtures”, Annals of Mathematical Statistics,34, 1265-1269.

[46] Timan, A.F. (1963), Theory of Approximation of Functions of a Real Variable, MacMilan,New York.

[47] Train, K. (2008), “EM Algorithms for Nonparametric Estimation of Mixing Distributions”,Journal of Choice Modeling, 1, 1, 40–69.

[48] Van der Vaart, W. and J.A. Wellner (1996), Weak Convergence and Empirical Processes,Springer Series in Statistics, New York.

[49] Varadhan, R. and C. Roland (2008), “Simple and globally convergent methods for accel-erating the convergence of any EM algorithm”, Scandinavian Journal of Statistics, 35, 2,335–353.

[50] Verbeek, J.J., N. Vlassis and B. Kröse (2003), “Efficient Greedy Learning of GaussianMixture Models”, Neural Computation, Vol. 15, pp 469–485.

[51] Zeevi, A. J. and R. Meir (1997), “Density Estimation Through Convex Combinations ofDensities: Approximation and Estimation Bounds,” Neural Networks, 10(1), 99–109.

51

N R M K RMISE max IAE mean IAE min IAE

Time

Using lsqlin (s)

Dynamic

Prog Time (s)

Mean # of

Positive Weights

100 100 1 4 0.23 0.38 0.18 0.1 0.46 382.03 7.59100 100 1 8 0.3 0.37 0.28 0.23 0.46 402.89 6.07100 200 1 4 0.21 0.35 0.16 0.07 1.55 610.64 7.91100 200 1 8 0.3 0.34 0.27 0.21 1.57 654.59 6.52500 100 1 4 0.21 0.27 0.18 0.13 0.79 1527.67 8.7

500 100 1 8 0.3 0.3 0.27 0.24 0.82 1603.61 7.13500 200 1 4 0.19 0.25 0.15 0.09 4 3123.93 9.77500 200 1 8 0.3 0.3 0.28 0.24 3.81 3276.52 7.93

1000 100 1 4 0.21 0.24 0.18 0.13 1.46 2425.95 8.941000 100 1 8 0.29 0.31 0.27 0.24 1.32 3149.32 7.35

1000 200 1 4 0.18 0.21 0.15 0.1 6.87 6078.88 10.72

1000 200 1 8 0.3 0.31 0.28 0.25 6.98 6379.8 8.47100 100 2 4 0.29 0.46 0.24 0.15 0.37 314.88 7.84

100 100 2 8 0.28 0.35 0.26 0.2 0.36 326.87 6.85100 200 2 4 0.28 0.41 0.23 0.1 1.53 613.65 8.45100 200 2 8 0.29 0.31 0.26 0.2 1.54 647.46 7.44

500 100 2 4 0.28 0.33 0.24 0.16 0.84 1517.89 8.75500 100 2 8 0.29 0.29 0.26 0.23 0.82 1606.72 9.04

500 200 2 4 0.26 0.31 0.23 0.14 3.86 3041.95 9.89500 200 2 8 0.29 0.29 0.26 0.23 3.6 3199.49 9.92

1000 100 2 4 0.28 0.3 0.25 0.21 1.39 3014.08 9.141000 100 2 8 0.29 0.29 0.26 0.24 1.51 3243.94 9.61

1000 200 2 4 0.26 0.33 0.23 0.16 6.87 6016.28 10.561000 200 2 8 0.29 0.28 0.26 0.24 6.8 6241.11 11.36100 100 5 4 0.31 0.39 0.28 0.17 0.5 398.14 7.71

100 100 5 8 0.31 0.35 0.28 0.19 0.49 416.35 7.02100 200 5 4 0.3 0.4 0.26 0.14 1.58 617.19 8.15

100 200 5 8 0.31 0.34 0.28 0.23 1.56 658.42 7.19500 100 5 4 0.32 0.36 0.29 0.23 0.74 1514.24 8.66

500 100 5 8 0.3 0.31 0.28 0.25 0.84 1625.02 9.53

500 200 5 4 0.29 0.35 0.26 0.18 3.78 3050.03 9.92

500 200 5 8 0.31 0.31 0.28 0.23 5.82 4010.44 10.771000 100 5 4 0.32 0.33 0.3 0.23 1.49 3015.38 8.331000 100 5 8 0.3 0.3 0.28 0.25 1.45 3174.38 11.171000 200 5 4 0.3 0.36 0.28 0.19 6.96 6049.14 10.491000 200 5 8 0.3 0.3 0.28 0.25 6.56 5122.4 12.9

Table 1: Fixed Grid Least Squares

N R M K RMISE max IAE mean IAE min IAE

Time in

EM Iteration

Mean # of

Positive Weights

100 100 1 4 0.26 0.4 0.22 0.1 0.49 6.81100 100 1 8 0.39 0.46 0.35 0.26 0.45 5.92100 200 1 4 0.24 0.4 0.18 0.08 0.81 7.26100 200 1 8 0.38 0.43 0.35 0.29 0.66 6.55500 100 1 4 0.25 0.31 0.23 0.14 1.06 8.49

500 100 1 8 0.38 0.39 0.35 0.32 1.23 9.04500 200 1 4 0.19 0.3 0.15 0.1 1.73 9.14500 200 1 8 0.38 0.39 0.35 0.31 1.36 10.86

1000 100 1 4 0.25 0.29 0.23 0.17 1.65 8.741000 100 1 8 0.38 0.42 0.35 0.32 1.29 10.2

1000 200 1 4 0.18 0.22 0.15 0.1 2.33 9.9

1000 200 1 8 0.38 0.38 0.35 0.32 2.3 12.89100 100 2 4 0.31 0.47 0.27 0.16 0.43 7.07

100 100 2 8 0.35 0.42 0.32 0.22 0.43 6.5100 200 2 4 0.3 0.4 0.26 0.12 0.78 7.95100 200 2 8 0.35 0.41 0.32 0.22 0.78 6.92

500 100 2 4 0.31 0.34 0.28 0.23 1.3 8.31500 100 2 8 0.35 0.37 0.33 0.27 1.17 9.74

500 200 2 4 0.28 0.33 0.25 0.15 1.54 10500 200 2 8 0.35 0.37 0.32 0.27 1.47 10.95

1000 100 2 4 0.31 0.32 0.29 0.24 1.43 8.581000 100 2 8 0.35 0.37 0.32 0.29 1.63 10.8

1000 200 2 4 0.28 0.31 0.25 0.17 2.26 10.731000 200 2 8 0.35 0.35 0.32 0.29 2.25 13.49100 100 5 4 0.35 0.4 0.32 0.23 0.55 6.84

100 100 5 8 0.38 0.43 0.34 0.25 0.51 6.46100 200 5 4 0.34 0.42 0.3 0.17 0.79 7.71

100 200 5 8 0.38 0.42 0.35 0.27 0.84 6.91500 100 5 4 0.35 0.36 0.32 0.28 1.11 8.17

500 100 5 8 0.38 0.39 0.35 0.31 1.23 9.87

500 200 5 4 0.32 0.36 0.3 0.23 1.42 9.93

500 200 5 8 0.38 0.38 0.35 0.31 1.74 11.311000 100 5 4 0.35 0.35 0.32 0.28 1.67 8.371000 100 5 8 0.38 0.37 0.35 0.32 1.58 11.11000 200 5 4 0.33 0.35 0.3 0.23 2.55 10.731000 200 5 8 0.38 0.38 0.35 0.31 2.23 13.87

*: the time to compute the dynamic programming problems is the same as in Table 1.

The two estimators use the same grid.

Table 2: Fixed Grid Likelihood*

M RMISE max IAE mean IAE min IAETotal

Time (s)*

Mean # of Positive Weights

# Dynamic

Prog Called

1 0.32 0.51 0.28 0.15 11.95 4.66 500

2 0.35 0.48 0.31 0.14 11.87 4.93 500

5 0.39 0.53 0.34 0.19 11.83 4.80 500


Time (s)*

Mean # of Positive Weights

# Dynamic

Prog Called

1 0.38 0.47 0.35 0.23 12.02 4.59 500

2 0.37 0.49 0.33 0.22 11.94 4.95 500

5 0.41 0.54 0.36 0.26 11.91 4.76 500


Time (s)

Mean # of

Positive Weights

# Dynamic

Prog Called

1 0.15 0.17 0.09 0.05 10501.86 9.94 225887

2 0.14 0.15 0.09 0.04 9488.15 9.93 206134

5 0.14 0.18 0.09 0.06 9829.64 9.95 208435

*: Combined time of dynamic programming and optimization

**: Iterate EM ten times from one starting point

Fixed Grid Least Squares

Flexible Grid Likelihood**

Fixed Grid Likelihood

Table 3: N =50, R =10, K =4, All Three Estimators

asimplenonparametricapproachtoestimatingthedistribution...

Documents