statisticsdept.stat.lsa.umich.edu/~xuanlong/summerschool12/lec4-at/...convergence of latent mixing...

27
Convergence of latent mixing measures in finite and infinite mixture models 1 XuanLong Nguyen [email protected] Department of Statistics University of Michigan Abstract We consider Wasserstein distances for assessing the convergence of latent discrete measures, which serve as mixing distributions in hierarchical and nonparametric mix- ture models. We clarify the relationships between Wasserstein distances of mixing distributions and f -divergence functionals such as Hellinger and Kullback-Leibler dis- tances on the space of mixture distributions using various identifiability conditions. The convergence in Wasserstein metrics for discrete measures has a natural interpre- tation of the convergence of individual atoms that provide support for the discrete measure. It is also typically stronger than the weak convergence induced by standard f -divergence metrics. We establish rates of convergence of posterior distributions for latent discrete measures in several mixture models, including finite mixtures of multi- variate distributions and infinite mixtures based on the Dirichlet process. 1 Introduction A notable feature in the development of hierarchical and Bayesian nonparametric models is the role of discrete probability measures, which serve as mixing distributions to combine relatively simple models into richer classes of statistical models [20, 22]. In recent years the mixture modeling methodology has been significantly extended, by many authors taking the mixing measure to be random and infinite dimensional via suitable priors constructed in a nested, hierarchical and nonparametric manner. This results in rich models that can fit more complex and high dimensional data (see, e.g., [10, 28, 26, 25, 23] for several examples of such models, as well as a recent book [16]). 1 AMS 2000 subject classification. Primary 62F15, 62G05; secondary 62G20. Key words and phrases: Mixture distributions, hierarchical models, Bayesian nonparametrics, Wasserstein distance, f -divergence, rates of convergence, Dirichlet processes. This draft is currently under review with the Annals of Statistics (May 2012). An earlier version appeared as Technical Report 527, Department of Statistics, University of Michigan (September 2011), under title ”Wasser- stein distances for discrete measures and convergence in nonparametric mixture models”. We thank the AE, an anonymous reviewer, and Arash Amini for valuable comments and suggestions on earlier versions. This work was supported in part by NSF grants CCF-1115769 and OCI-1047871. 1

Upload: others

Post on 15-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Convergence of latent mixing measures

    in finite and infinite mixture models1

    XuanLong [email protected]

    Department of StatisticsUniversity of Michigan

    Abstract

    We consider Wasserstein distances for assessing the convergence of latent discretemeasures, which serve as mixing distributions in hierarchical and nonparametric mix-ture models. We clarify the relationships between Wasserstein distances of mixingdistributions andf -divergence functionals such as Hellinger and Kullback-Leibler dis-tances on the space of mixture distributions using various identifiability conditions.The convergence in Wasserstein metrics for discrete measures has a natural interpre-tation of the convergence of individual atoms that provide support for the discretemeasure. It is also typically stronger than the weak convergence induced by standardf -divergence metrics. We establish rates of convergence of posterior distributions forlatent discrete measures in several mixture models, including finite mixtures of multi-variate distributions and infinite mixtures based on the Dirichlet process.

    1 Introduction

    A notable feature in the development of hierarchical and Bayesian nonparametric modelsis the role of discrete probability measures, which serve as mixing distributionsto combinerelatively simple models into richer classes of statistical models [20, 22]. In recent years themixture modeling methodology has been significantly extended, by many authorstaking themixing measure to be random and infinite dimensional via suitable priors constructed in anested, hierarchical and nonparametric manner. This results in rich models that can fit morecomplex and high dimensional data (see, e.g., [10, 28, 26, 25, 23] for several examples ofsuch models, as well as a recent book [16]).

    1 AMS 2000 subject classification. Primary 62F15, 62G05; secondary 62G20.

    Key words and phrases: Mixture distributions, hierarchical models, Bayesian nonparametrics, Wassersteindistance,f -divergence, rates of convergence, Dirichlet processes.

    This draft is currently under review with the Annals of Statistics (May 2012). An earlier version appeared asTechnical Report 527, Department of Statistics, University of Michigan(September 2011), under title ”Wasser-stein distances for discrete measures and convergence in nonparametric mixture models”. We thank the AE, ananonymous reviewer, and Arash Amini for valuable comments and suggestions on earlier versions. This workwas supported in part by NSF grants CCF-1115769 and OCI-1047871.

    1

  • The focus of this paper is to analyze convergence behavior of the posterior distributionof latent mixing measures as they arise in several mixture models, including finitemix-tures and the infinite Dirichlet process mixtures. LetG =

    ∑ki=1 piδθi denote a discrete

    probability measure. Atomsθi’s are elements in spaceΘ, while vector of probabilitiesp = (p1, . . . , pk) lies in ak − 1 dimensional probability simplex. In a mixture setting,G iscombined with a likelihood densityf(·|θ) with respect to a dominating measureµ onX , toyield the mixture density:pG(x) =

    f(x|θ)dG(θ) = ∑ki=1 pif(x|θi). In a clustering ap-plication, atomsθi’s represent distinct behaviors in a heterogeneous data population, whilemixing probabilitiespi’s are the associated proportions of such behaviors. Under this inter-pretation, there is a need for comparing and assessing the quality of the discrete measurêGestimated on the basis of available data. An important work in this direction is by Chen [5],who used theL1 metric on the cumulative distribution functions on the real line to studyconvergence rates of the mixing measureG. Chen’s results were subsequently extended to aBayesian estimation setting for a univariate mixture model [17]. These workswere limitedto only univariate and finite mixture models, withk bounded by a known constant, whileour interest is whenk may be unbounded, andΘ is multi-dimensional or even an abstractspace.

    The analysis of consistency and convergence rates of posterior distributions for Bayesianestimation have seen much progress in the past decade. Key recent references include[1, 13, 27, 33, 14, 34]. Analysis of specific mixture models in a Bayesian setting have alsobeen studied [12, 11, 18, 15]. All these work primarily focus on the convergence behaviorof the posterior distribution of the data densitypG. On the other hand, there seems to bevery few results concerning with the convergence behavior of latent mixing measuresG.Notably, the analysis of convergence for mixing (smooth) densities often arised in the con-text of frequentist estimation for deconvolution problems, mainly within the kernel densityestimation method (e.g., [4, 35, 8]).

    The primary contribution of this paper is to show that the Wasserstein distances providea natural and useful metric for the analysis of convergence for latent and discrete mixingmeasures in mixture models, and to establish convergence rates of posteriordistributionsin a number of well-known Bayesian nonparametric and mixture models. Wasserstein dis-tances originally arised in the problem of optimal transportation [31]. They have beenutilized in a number of statistical contexts (e.g., [7, 21, 2, 6]). For discrete probability mea-sures, they can be obtained by a minimum matching (or moving) procedure between thesets of atoms that provide support for the measures under comparison, and consequentiallyare simple to compute. Suppose thatΘ is equipped with a metricρ. Let G′ =

    ∑k′

    j=1 p′jδθ′j .

    Then, theLr Wasserstein distance metric on the space of discrete probability measures withsupport inΘ, namely,Ḡ(Θ), is:

    dρ(G, G′) =

    [

    infq

    i,j

    qijρr(θi, θ

    ′j)

    ]1/r

    ,

    where the infimum is taken over all joint probability distributions on[1, . . . , k]× [1, . . . , k′]such that

    j qij = pi and∑

    i qij = p′j .

    2

  • As clearly seen from this definition, the Wasserstein distances inherit directly the metricof the space of atomic supportΘ, suggesting that they can be useful for assessing estimationprocedures for discrete measures in hierarchical models. It is worth noting that if (Gn)n≥1is a sequence of discrete probability measures withk distinct atoms andGn tends to somediscrete measureG0 in dρ metric, thenGn’s ordered set of atoms must converge toG0’satoms inρ after some permutation of atom labels. Thus, in the clustering application il-lustrated above, convergence of mixing measureG may be interpreted as the convergenceof distinct typical behaviorθi’s that characterize the heterogeneous data population. A hintfor the relevance of the Wasserstein distances can be drawn from an observation that theL1distance for the CDF’s of univariate random variables, as studied by Chen [5], is in fact aspecial case of theL1 Wasserstein metric whenΘ = R.

    The plan for the paper is as follows. Section 2 explores the relationship between Wasser-stein distances for mixing measures and well-known divergence functionals for mixturedensities in a mixture model. We produce a simple lemma which gives an upper bound onf -divergences between mixture densities by certain Wasserstein distancesbetween mixingmeasures. This implies thatdρ topology can be stronger than those induced by divergencesbetween mixture densities. Next, we consider various identifiability conditions under whichconvergence of mixture densities entails convergence of mixing measures inthe Wassersteinmetric. We present two theorems, which provide upper bounds ondρ(G, G′) in terms ofdivergences betweenpG andpG′ . Theorem 1 is applicable to mixing measures with abounded number of atomic support, generalizing a result from [5]. Theorem 2 is applica-ble to mixing measures with unbounded number of support points, but is restricted to onlyconvolution mixture models.

    Section 3 focuses on the convergence of posterior distributions of latentmixing mea-sures in a Bayesian nonparametric setting. Here, the mixing measureG is endowed with aprior distributionΠ. Assuming ann-sampleX1, . . . , Xn that is generated according topG0 ,we study conditions under which the postetrior distribution ofG, namely,Π(·|X1, . . . , Xn),contracts to the “truth”G0 under thedρ metric, and provide the contraction rates. In Theo-rems 3 and 4 of Section 3, we establish the convergence rates for the posterior distributionfor G in terms ofdρ metric. These results are proved using the standard approach of Ghosal,Ghosh and van der Vaart [13]. Our convergence theorems have several notable features.They rely on separate conditions for the priorΠ and likelihood functionf , which are typi-cally simpler to verify than conditions formulated in terms of mixture densities. The claimof convergence in Wasserstein metrics is typically stronger than the weak convergence in-duced by the Hellinger metric in the existing work mentioned above.

    In Section 4 posterior consistency and convergence rates of latent mixingmeasures arederived for a number of well-known mixture models in the literature, including finite mix-tures of multivariate distributions, and infinite mixtures based on Dirichlet processes. For fi-nite mixtures with bounded number of atomic support inRd, the posterior convergence ratefor mixing measures is the minimax optimaln−1/4 under suitable identifiability conditions.For Dirichlet process mixtures defined onRd, specific rates are established under smooth-ness conditions of the likelihood density functionf . In particular, for ordinary smooth

    3

  • likelihood densities with smoothnessβ (e.g., Laplace), the rate achieved is(log n/n)γ foranyγ < 2(d+2)(4+(2β+1)d) . For supersmooth likelihood densities with smoothnessβ (e.g.,

    normal), the rate achieved is(log n)−1/β .

    Notations. For ease of notations, we also usefi in place off(·|θi), and f ′j in placeof f(·|θ′j) for likelihood density functions. Divergences (distances) studied in the paperinclude the total variational distance:V (pG, pG′) :=

    12

    |pG(x) − pG′(x)|dµ(x), theHellinger distance:h2(pG, pG′) :=

    12

    (√

    pG(x) −√

    pG′(x))2dµ(x), and the Kullback-

    Leibler divergence:K(pG, pG′) =∫

    pG(x) log(pG(x)/pG′(x))dµ(x). These divergencesare related byV 2/2 ≤ h2 ≤ V andh2 ≤ K/2. N(ǫ, Θ, ρ) denotes the covering num-ber of the metric space(Θ, ρ), i.e., the minimum number ofǫ-balls needed to cover theentire spaceΘ. D(ǫ, Θ, ρ) denotes the packing number of(Θ, ρ), i.e., the maximum num-ber of points that are mutually separated by at leastǫ in distance. They are related byN(ǫ, Θ, ρ) ≤ D(ǫ, Θ, ρ) ≤ N(ǫ/2, Θ, ρ). Diam(Θ) denotes the diameter ofΘ.

    2 Wasserstein distances for mixing measures

    2.1 Definition and a basic inequality

    Let (Θ, ρ) be a space equiped with a non-negative distance functionρ : Θ × Θ → R+, i.e.function that satisfiesρ(θ1, θ2) = 0 if and only if θ1 = θ2. If in addition,ρ is symmetric(ρ(θ1, θ2) = ρ(θ2, θ1)), and satisfies the triangle inequality, then it is a proper metric. Adiscrete probability measureG on a measure space equipped with the Borel sigma algebratakes the formG =

    ∑ki=1 piδθi for somek ∈ N ∪ {+∞}, wherep = (p1, p2, . . . , pk)

    denotes the proportion vector, whileθ = (θ1, . . . , θk) are the associated atoms inΘ. phas to satisfy0 ≤ pi ≤ 1 and

    ∑ki=1 pk = 1. (With a bit abuse of notations, we write

    k = ∞ whenG = ∑∞i=1 piδθi has countably infinite support points represented by theinfinite sequence of atomsθ = (θ1, . . .) and the associated sequence of probability massp.) Likewise,G′ =

    ∑k′

    j=1 p′jδθ′j is another discrete probability measure that has at most

    k′ distinct atoms. LetGk(Θ) denote the space of all discrete probability measures withat mostk atoms. LetG(Θ) = ∪k∈N+Gk(Θ), the set of all discrete measures with finitesupport. Finally,Ḡ(Θ) denotes the space of all discrete measures (including those withcountably infinite support).

    Let q = (qij)i≤k;j≤k′ ∈ [0, 1]k×k′

    denote a joint probability distribution onN+ × N+that satisfies the marginal constraints:

    ∑ki=1 qij = p

    ′j and

    ∑k′

    j=1 qij = pi for any i =1, . . . , k; j = 1, . . . , k′. We also callq a coupling ofp andp′. Let Q(p, p′) denote thespace of all such couplings. We start with theL1 Wasserstein distance:

    Definition 1. Letρ be a distance function onΘ. The Wasserstein distance for two discretemeasuresG(p, θ) andG′(p′, θ′) is:

    dρ(G, G′) = inf

    q∈Q(p,p′)

    i,j

    qijρ(θi, θ′j). (1)

    4

  • Our main focus is on theL1 Wassersteindρ, andL2 Wasserstein distance. The lat-ter corresponds to the square root ofdρ2 in our definition, whereρ(θi, θ

    ′j) is replaced by

    ρ2(θi, θ′j). WhenΘ is a metric space (e.g.,R

    d), andρ is taken to be its metric, we revertto the more standard notations,W1(G, G′) for dρ(G, G′), andW 22 (G, G

    ′) for dρ2(G, G′).

    However,dρ will be employed whenρ may be a general or a non-standard distance functionor metric.

    From here on, discrete measureG ∈ Ḡ(Θ) plays the role of the mixing distribution in amixture model. Letf(x|θ) denote the density (with respect to a dominating measureµ) ofa random variableX taking values inX , given parameterθ ∈ Θ. For the ease of notations,we also usefi(x) for f(x|θi). CombiningG with the likelihood functionf yields a mixturedistribution forX that takes the following density:

    pG(x) =

    f(x|θ)dG(θ) =k

    i=1

    pifi(x).

    A central theme in this paper is to explore the relationship between Wassersteindis-tances of mixing measuresG, G′, e.g.,dρ(G, G′), and divergences of mixture densitiespG, pG′ . Divergences that play important roles in this paper are the total variational dis-tance, Hellinger distance, and the Kullback-Leibler distance. All these arein fact instancesof a broader class of divergences known as thef -divergences (Csizar, 1966; Ali & Silvey,1967):

    Definition 2. Letφ : R → R denote a convex function. Anf -divergence (or Ali-Silvey dis-tance) between two probability densitiesfi andf ′j is defined asρφ(fi, f

    ′j) =

    φ(f ′j/fi)fidµ.Likewise, thef -divergence betweenpG andpG′ is ρφ(pG, pG′) =

    φ(pG′/pG)pGdµ.

    f -divergences can be used as a distance function or metric onΘ. Whenρ is taken to beanf -divergence,ρ(θi, θ′j) := ρφ(fi, f

    ′j), for a convex functionφ, we call the corresponding

    Wasserstein distance acompositeWasserstein distance:

    dρφ(G, G′) := inf

    q∈Q(p,p′)

    ij

    qijρφ(fi, f′j).

    For φ(u) = 12(√

    u − 1)2 we obtain the squared Hellinger (ρ2h ≡ h2), which induces thecomposite Wasserstein distancedρ2h . For φ(u) =

    12 |u − 1| we obtain the variational dis-

    tance (ρV ≡ V ), which inducesdρV . Forφ(u) = − log u, we obtain the Kullback-Leiberdivergence (ρK ≡ K), which inducesdρK .

    Lemma 1. LetG, G′ ∈ Ḡ(Θ) such that bothρφ(pG, pG′) anddρφ(G, G′) are finite for someconvex functionφ. Then,ρφ(pG, pG′) ≤ dρφ(G, G′).

    This lemma highlights a simple direction in the aforementioned relationship: anyf -divergence between mixture distributionspG andpG′ is dominated by a Wasserstein dis-tance between mixing measuresG andG′. As will be evident in the sequel, this basic

    5

  • inequality is also handy in enabling us to obtain upper bounds on the power oftests. It alsoproves useful for establishing lower bounds on small Kullback-Leibler ball probabilities inthe space of mixture densitiespG in terms of small ball probabilities in the metric space(Θ, ρ). The latter quantities are typically easier to obtain estimates for than the former.

    Example 1. Suppose thatΘ = Rd, ρ is the Euclidean metric,f(x|θ) is the multivariatenormal densityN(θ, Id×d) with meanθ and identity covariance matrix, thenh2(fi, f ′j) =1 − exp−18‖θi − θ′j‖2 ≤ 18‖θi − θ′j‖2 = ρ2(θi, θ′j)2/8. So,dρ2h(G, G

    ′) ≤ dρ2(G, G′)/8.The above lemma then entails thath2(pG, pG′) ≤ dρ2(G, G′)/8 = W 22 (G, G′)/8.

    Similarly, for the Kullback-Leibler divergence, sinceK(fi, f ′j) =12‖θi − θ′j‖2, by

    Lemma 1,K(pG, pG′) ≤ dρK (G, G′) = 12dρ2(G, G′) = W2(G, G′)2/2.For another example, iff(x|θ) is a Gamma density with location parameterθ, Θ is a

    compact subset ofR that is bounded away from 0. ThenK(fi, f ′j) = O(|θi − θj |). Thisentails thatK(pG, pG′) ≤ dρK (G, G′) ≤ O(W1(G, G′)).

    2.2 Wasserstein metric identifiability in finite mixture models

    Lemma 1 shows that for many choices ofρ, dρ yields a stronger topology on̄G(Θ) thanthe topology induced byf -divergences on the space of mixture distributionspG. In otherwords, convergence ofpG may not imply convergence ofG in dρ metric. To ensure thisproperty, additional conditions are needed on the space of discrete measuresḠ(Θ), alongwith identifiability conditions for the family of likelihood functions{f(·|θ), θ ∈ Θ}.

    The classical definition of Teicher [30] specifies the family{f(·|θ), θ ∈ Θ} to be iden-tifiable if for anyG, G′ ∈ G(Θ), ‖pG−pG′‖∞ = 0 implies thatG = G′. We need a slightlystronger version, allowing for the inclusion for discrete measures with infinite support:

    Definition 3. The family{f(·|θ), θ ∈ Θ} is finitely identifiable if for anyG ∈ GΘ andG′ ∈ ḠΘ, |pG(x) − pG′(x)| = 0 for almost allx ∈ X implies thatG = G′.

    To obtain convergence rates, we also need the notion of strong identifiabilityof [5],herein adapted to a multivariate setting.

    Definition 4. Assume thatΘ ⊆ Rd andρ is the Euclidean metric. The family{f(·|θ), θ ∈Θ} is strongly identifiable iff(x|θ) is twice differentiable inθ and for any finitek andkdifferentθ1, . . . , θk, the equality

    ess supx∈X

    k∑

    i=1

    αif(x|θi) + βTi Df(x|θi) + γTi D2f(x|θi)γi∣

    = 0 (2)

    implies thatαi = 0, βi = γi = 0 ∈ Rd for i = 1, . . . , k. Here, for eachx, Df(x|θi) andD2f(x|θi) denote the gradient and the Hessian atθi of functionf(x|·), respectively.

    Finite identifiability is satisfied for the family of Gaussian distributions for both meanand variance parameters [29], see also Theorem 1 of [18]. Chen identified a broad class of

    6

  • families, including the Gaussian family, for which the strong identifiability conditionholds[5].

    Defineψ(G, G′) = supx |pG(x) − pG′(x)|/W 22 (G, G′) if G 6= G′ and∞ otherwise.Also defineψ1(G, G′) = V (pG, pG′)/W 22 (G, G

    ′) if G 6= G′ and∞ otherwise. The notionof strong identifiability is useful via the following key result, which generalizes Chen’sresult toΘ of arbitrary dimensions.

    Theorem 1. (Strong identifiability). Suppose thatΘ is compact subset ofRd, the family{f(·|θ), θ ∈ Θ} is strongly identifiable, and for allx ∈ X , the Hessian matrixD2f(x|θ)satisfies a uniform Lipschitz condition

    |γT (D2f(x|θ1) − D2f(x|θ2))γ| ≤ C‖θ1 − θ2‖δ‖γ‖2, (3)

    for all x, θ1, θ2 and some fixedC andδ > 0. Then, for fixedG0 ∈ Gk(Θ), wherek < ∞:

    limǫ→0

    infG,G′∈Gk(Θ)

    {

    ψ(G, G′) : W2(G0, G) ∨ W2(G0, G′) ≤ ǫ}

    > 0. (4)

    The assertion also holds withψ being replaced byψ1.

    Remark. Suppose thatG0 has exactlyk distinct support points inΘ (i.e.,G =∑k

    i=1 piδθiwherepi > 0 for all i = 1, . . . , k). Then, an examination of the proof reveals that therequirement thatΘ be compact is not needed. Indeed, if there is a sequence ofGn ∈ Gk(Θ)such thatdρ(G0, Gn) → 0, then it is simple to show that there is a subsequence ofGn thatalso hask distinct atoms, which converge inρ metric to the set ofk atoms ofG0 (up tosome permutation of the labels). The proof of the theorem proceeds as before.

    For the rest of this paper, by strong identifiability we always mean conditionsspecifiedin Theorem 1 so that Eq. (4) can be deduced. This practically means that the conditionsspecified by Eq. (2) and Eq. (3) be given, while the compactness ofΘ may sometimes berequired.

    2.3 Wasserstein metric identifiability in infinite mixture models

    Next, we state a counterpart of Theorem 1 forG, G′ ∈ Ḡ(Θ), i.e., mixing measures withpotentially unbounded number of support points. We restrict our attention toconvolutionmixture models onRd. That is, the likelihood density functionf(x|θ), with respect toLebesgue, takes the formf(x − θ) for some multivariate density functionf on Rd. Thus,pG(x) = G ∗ f(x) =

    ∑ki=1 pif(x − θi) andpG′(x) = G′ ∗ f(x) =

    ∑k′

    j=1 p′jf(x − θ′j).

    As before, we need a compactness condition forΘ. Additional key assumptions concernwith the smoothness of density functionf . This is characterized in terms of the tail behaviorof the Fourier transform̃f of f : f̃(ω) =

    Rde−i〈ω,x〉f(x)dx. We consider both ordinary

    smooth densities (e.g., Laplace and Gamma), and supersmooth densities (e.g., normal). Thefollowing result does not require thatG, G′ be discrete.

    7

  • Theorem 2. Suppose thatG, G′ are probability measures that place full support on com-pact setΘ ⊂ Rd. f is a density function onRd that is symmetric (around 0), i.e.,

    A fdx =∫

    −A fdx for any Borel setA ⊂ Rd. Moreover, assume that̃f(ω) 6= 0 for all ω ∈ Rd.

    (1) Ordinary smooth likelihood. Suppose that|f̃(ω)∏dj=1 |ωj |β | ≥ d0 as ωj →∞ (j = 1, . . . , d) for some positive constantsd0 and β. Then for anym <4/(4 + (2β + 1)d), there is some constantC(d, β, m) dependent only ond, β andmsuch that

    W 22 (G, G′) ≤ C(d, β, m)V (pG, pG′)m,

    asV (pG, pG′) → 0.

    (2) Supersmooth likelihood. Suppose that|f̃(ω)∏dj=1 exp(|ωj |β/γ)| ≥ d0 as ωj →∞ (j = 1, . . . , d) for some positive constantsβ, γ, d0. Then there is some constantC(d, β) dependent only ond andβ such that

    W 22 (G, G′) ≤ C(d, β)(− log V (pG, pG′))−2/β ,

    asV (pG, pG′) → 0.

    Example 2. For the standard normal density onRd, f̃(ω) =∏d

    j=1 exp−ω2i /2, we obtainthat d2ρ(G, G

    ′) . (− log V (pG, pG′))−1 asdρ(G, G′) → 0 (so thatV (pG, pG′) → 0, byLemma 1). For a Laplace density onR, e.g.,f̃(ω) = 1

    1+ω2, thend2ρ(G, G

    ′) . V (pG, pG′)m

    for anym < 4/9, asdρ(G, G′) → 0.

    3 Convergence of posterior distributions of mixing measures

    We turn to a study of convergence of discrete mixing measures in a Bayesiansetting. LetX1, . . . , Xn be an iid sample according to the mixture densitypG(x) =

    f(x|θ)dG(θ),wheref is known, whileG = G0 for some unknown mixing measure inGk(Θ). The truenumber support points forG may be unknown. In the Bayesian estimation framework,Gis endowed with a prior distributionΠ on a suitable measure space of discrete probabilitymeasures in̄G(Θ). The posterior distribution ofG is given by, for any measurable setB:

    Π(B|X1, . . . , Xn) =∫

    B

    n∏

    i=1

    pG(Xi)Π(G)/

    ∫ n∏

    i=1

    pG(Xi)Π(G).

    We shall study conditions under which the posterior distribution is consistent,i.e., itconcentrates on arbitrarily smallW2 neighborhoods ofG0, and establish the rates of theconvergence. We follow the general framework of Ghosal, Ghosh andvan der Vaart [13],who analyzed convergence behavior of posterior distributions in terms off -divergencessuch as Hellinger and variational distances on the mixture densities of the data. In thefollowing we formulate two convergence theorems for mixture model setting (which can be

    8

  • viewed as counterparts of Theorem 2.1 and 2.4 of [13]). A notable feature of our theoremsis that conditions (e.g., entropy and prior concentration) are stated directlyin terms of theWasserstein metric, as opposed tof -divergences on the mixture densities. They may betypically separated into independent conditions for the prior forG and the likelihood familyand are simpler to verify for mixture models. In addition, the convergence ofthe posteriordistribution of mixing measures is established in terms ofL2 Wasserstein distance metric.

    The following notion plays a central role in our general results.

    Definition 5. Fix k < ∞ and G0 ∈ Gk(Θ). Let G be a subset of̄G(Θ). Define theHellinger information ofW2 metric for subsetG as a real-valued function on the real lineCk(G, ·) : R → R:

    Ck(G, r) = infG∈G:W2(G0,G)≥r/2

    h2(pG0 , pG). (5)

    It is obvious thatCk is a non-negative and non-decreasing function. The followingcharacterizations ofCk are simple consequences of Theorems 1 and 2:

    Proposition 1. (a) Suppose thatG andGk(Θ) are both compact in the Wasserstein topol-ogy, and the family of likelihood functions is finitely identifiable. Then,Ck(G, r) > 0for anyr > 0.

    (b) Suppose thatΘ ⊂ Rd is compact, and the family of likelihood functions is stronglyidentifiable as specified in Theorem 1. Then, foreachk there is a constantc(k, G0) >0 such thatCk(Gk(Θ), r) ≥ c(k, G0)r4 for all r > 0.

    (c) Suppose thatΘ ⊂ Rd is compact, and the family of likelihood functions is ordinarysmooth with parameterβ, as specified in Theorem 2. Then, for anyǫ > 0, thereis some constantc(d, β) such thatCk(Ḡ(Θ), r) ≥ c(d, β)r4+(2β+1)d+ǫ for anyr >0, and anyk > 0. For supersmooth likelihood family, we haveCk(Ḡ(Θ), r) ≥exp[−c(d, β)r−β] for anyr > 0 and anyk > 0.

    A main ingredient in the analysis of convergence of posterior distributions isthroughproving the existence of tests for subsets of parameters of interest. A testϕn is a measur-able indicator function of the iid sampleX1, . . . , Xn. For a fixed pair of discrete measures(G0, G1) ∈ (Gk(Θ) × G), whereG is a given subset of̄G(Θ), consider tests for discrimi-natingG0 against a closed Wasserstein ball centered atG1. Write

    BW (G1, r) = {G ∈ Ḡ(Θ) : W2(G1, G) ≤ r}.

    The following lemma highlights the role of the Hellinger information:

    Lemma 2. Suppose that(Θ, ρ) is a metric space. Fixk < ∞ and G0 ∈ Gk(Θ). G isa subset of mixing measures such thatGk(Θ) ⊆ G ⊆ Ḡ(Θ). GivenG1 ∈ G, let r =W2(G0, G1). Suppose that either one of the following two sets of conditions holds:

    (I) G is a convex set, in which case, setM(G, G1, r) = 1.

    9

  • (II) G is non-convex, whileΘ is a totally bounded and bounded set. In addition, for someconstantsC1 > 0, α ≥ 1, h(fi, f ′j) ≤ C1ρα(θi, θ′j) for any likelihood functionsfi, f ′jin the family. In this case, define

    M(G, G1, r) = D(

    Ck(G, r)1/4Diam(Θ)α−1

    √2C1

    ,G ∩ BW (G1, r/2), W2)

    . (6)

    Then, there exist tests{ϕn} that have the following properties:

    PG0ϕn ≤ M(G, G1, r) exp[−nCk(G, r)/8], (7)sup

    G∈G∩BW (G1,r/2)PG(1 − ϕn) ≤ exp[−nCk(G, r)/8]. (8)

    Here,PG denotes the expectation under the mixture distribution given by densitypG.

    Remark. The set of conditions (II) is needed whenG is not convex, an example of whichis G = Gk(Θ), the space of measures with at mostk support points inΘ. It is interestingto note that the loss in test power due to the lack of convexity is captured by thelocalentropy termlog M(G, G1, r). This quantity is defined in terms of the packing by smallW2 balls whose radii are specified by the Hellinger information. We note briefly that thisquantity can be easily bounded in some cases: for instance, ifΘ ⊂ Rd andG = Gk(Θ),then by Proposition 1, under strong identifiability,Ck(G, r) ≥ c(k, G0)r4. In addition,under the assumption that for allG ∈ G, both the masses for allk support points ofG,and the pairwise distances of such support points, are uniformly bounded away from 0, itcan be shown thatM(G, G1, r) is bounded by a constant that depends on onlyG0, Θ andk(Further details are elaborated in the proof of Theorem 5.)

    Next, the existence of test can be shown for discriminatingG0 against the complementof a closed Wasserstein ball:

    Lemma 3. Assume that conditions of Lemma 2 hold and defineM(G, G1, r) as in Lemma2. Suppose that for some non-increasing functionD(ǫ), someǫn ≥ 0 and everyǫ > ǫn,

    supG1∈G

    M(G, G1, ǫ) × D(ǫ/2,G ∩ BW (G0, 2ǫ) \ BW (G0, ǫ), W2) ≤ D(ǫ). (9)

    Then, for everyǫ > ǫn, for anyt0 ∈ N, there exist testsϕn (depending onǫ > 0) such that

    PG0ϕn ≤ D(ǫ)⌈Diam(Θ)/ǫ⌉

    t=t0

    exp[−nCk(G, tǫ)/8] (10)

    supG∈G:W2(G0,G)>t0ǫ

    PG(1 − ϕn) ≤ exp[−nCk(G, t0ǫ)/8]. (11)

    10

  • Remark. It is interesting to observe that functionD(ǫ) is used to control the packingnumber of thin layers of Wasserstein balls (a similar quantity that also arises viathe peelingargument in [13] (Theorem 7.1)), in addition to the packing numberM of small Wassersteinballs in terms of smaller Wasserstein balls whose radii are specified in terms of the Hellingerinformation function. As in the previous lemma, the latter packing number appears intrinsicto the analysis of convergence for mixing measures.

    The preceeding two lemmas provide the core argument for establishing the followinggeneral posterior contraction theorems for latent mixing measures in a mixturemodel. Thefollowing two theorems have three types of conditions. The first is concerned with the sizeof support ofΠ, often quantified in terms of its entropy number. Estimates of the entropynumber defined in terms of Wasserstein metrics for several measure classes of interest aregiven in Lemma 4. The second condition is on the Kullback-Leibler support of Π, whichis related to both space of discrete measuresḠ(Θ) and the family of likelihood functionsf(x|θ). The Kullback-Leibler neighborhood is defined as:

    BK(ǫ) =

    {

    G ∈ Ḡ(Θ) : −PG0(

    logpGpG0

    )

    ≤ ǫ2, PG0(

    logpGpG0

    )2

    ≤ ǫ2}

    . (12)

    The third type of conditions is on the Hellinger information ofW2 metric, functionCk(G, r),a characterization of which is given above.

    Theorem 3. LetG0 ∈ Gk(Θ) ⊆ Ḡ(Θ) for somek < ∞, and the family of likelihood func-tions is finitely identifiable. Suppose that for a sequence(ǫn)n≥1 that tends to a constant(or 0) such thatnǫ2n → ∞, and a constantC > 0, andconvexsetsGn ⊂ Ḡ(Θ), we have:

    log D(ǫn,Gn, W2) ≤ nǫ2n, (13)Π(Ḡ(Θ) \ Gn) ≤ exp[−nǫ2n(C + 4)], (14)

    Π(BK(ǫn)) ≥ exp(−nǫ2nC). (15)

    Moreover, supposeMn is a sequence such that

    Ck(Gn, Mnǫn) ≥ 8ǫ2n(C + 4), (16)exp(2nǫ2n)

    j≥Mnexp[−nCk(Gn, jǫn)/8] → 0. (17)

    Then,Π(G : W2(G0, G) ≥ Mnǫn|X1, . . . , Xn) → 0 in PG0-probability.The following theorem uses a weaker condition on the covering number, but it contains

    an additional condition on the likelihood functions which may be useful for handling thecase of non-convex sievesGn:Theorem 4. LetG0 ∈ Gk(Θ) ⊆ Ḡ(Θ) for somek < ∞. Assume the following:

    (a) The family of likelihood functions is finitely identifiable, and satisfiesh(fi, f ′j) ≤C1ρ

    α(θi, θ′j) for any likelihood functionsfi, f

    ′j in the family, for some constantsC1 >

    0, α ≥ 1.

    11

  • (b) There is a sequence of setsGn ⊂ Ḡ(Θ) for whichM(Gn, G1, ǫ) is defined by Eq.(6).

    (c) There is a sequenceǫn → 0 such thatnǫ2n is bounded away from 0 or tending toinfinity, and a sequenceMn such that

    log D(ǫ/2,Gn ∩ BW (G0, 2ǫ) \ BW (G0, ǫ), W2) +sup

    G1∈Gnlog M(Gn, G1, ǫ) ≤ nǫ2n ∀ǫ ≥ ǫn, (18)

    Π(Ḡ(Θ) \ Gn)Π(BK(ǫn))

    = o(exp(−2nǫ2n)), (19)

    Π(BW (G0, 2jǫn) \ BW (G0, jǫn))Π(BK(ǫn))

    ≤ exp[nCk(Gn, jǫn)/16] ∀j ≥ Mn,(20)

    exp(2nǫ2n)∑

    j≥Mnexp[−nCk(Gn, jǫn)/16] → 0. (21)

    Then, we have thatΠ(G : W2(G0, G) ≥ Mnǫn|X1, . . . , Xn) → 0 in PG0-probability.

    Before moving to specific examples, we state a simple lemma which provides estimatesof the entropy underdρ metric for a number of classes of discrete measures of interest.Becausedρ inherits directly theρ metric in Θ, the entropy for classes in(Ḡ(Θ), dρ) cantypically be bounded in terms of the covering number for subsets of(Θ, ρ). Sincedρ ≡W1 is weaker than theL2 Wasserstein metricW2, the following bounds are immediatelyapplicable ifdρ is replaced byW2.

    Lemma 4. (a) log N(2ǫ,Gk(Θ), dρ) ≤ k(log N(ǫ, Θ, ρ) + log(e + eDiam(Θ)/ǫ)).

    (b) log N(2ǫ, Ḡ(Θ), dρ) ≤ N(ǫ, Θ, ρ) log(e + eDiam(Θ)/ǫ).

    (c) LetG0 =∑k

    i=1 p∗i δθ∗i ∈ Gk(Θ). Assume thatM = max

    ki=1 1/p

    ∗i < ∞ and

    m = mini,j≤k ρ(θ∗i , θ∗j ) > 0. Then,

    log N(ǫ/2, {G ∈ Gk(Θ) : dρ(G0, G) ≤ 2ǫ}, dρ)≤ k(sup

    Θ′log N(ǫ/4, Θ′, ρ) + log(32kDiam(Θ)/m)),

    where the supremum in the right side is taken over all bounded subsetsΘ′ ⊆ Θ suchthat Diam(Θ′) ≤ 4Mǫ.

    4 Examples

    In this section we derive posterior conraction rates for two classes of mixture models, e.g,finite mixtures of multivariate distributions, and infinite mixtures based on the Dirichletprocess.

    12

  • 4.1 Finite mixtures of multivariate distributions

    Let Θ be a subset ofRd, ρ be the Euclidean metric, andΠ is a prior distribution for discretemeasures inGk(Θ), wherek < ∞ is known. Suppose that the “truth”G0 =

    ∑ki=1 p

    ∗i δθ∗i ∈

    Gk(Θ). To obtain the convergence rate of the posterior distribution ofG, we need:

    Assumptions A.

    (A1) Θ is compact and the family of likelihood functionsf(·|θ) is strongly identifiable.

    (A2) For some positive constantC1, K(fi, f ′j) ≤ C1‖θi−θ′j‖2 for anyθi, θ′j ∈ Θ. For anyG ∈ supp(Π),

    pG0(log(pG0/pG))2 < C2K(pG0 , pG) for some constantC2 > 0.

    (A3) Under priorΠ, for smallδ > 0, c3δk ≤ Π(|pi − p∗i | ≤ δ, i = 1 . . . , k) ≤ C3δk andc3δ

    kd ≤ Π(‖θi − θ∗i ‖ ≤ δ, i = 1 . . . , k) ≤ C3δkd for some constantsc3, C3 > 0.

    (A4) Under priorΠ, all pi are bounded away from 0, and all pairwise distances‖θi − θj‖are bounded away from 0.

    Remarks. (A1) and (A2) hold for the family of Gaussian densities with mean parameterθ. (A3) holds when the prior distribution on the relevant parameters behavelike a uniformdistribution, up to a multiplicative constant.

    Theorem 5. Under Assumptions (A1–A4), the contraction rate in theL2 Wasserstein dis-tance metric of the posterior distribution ofG is n−1/4.

    Proof. Let G =∑k

    i=1 piδθi . Combining Lemma 1 with Assumption (A2), if‖θi − θ∗i ‖ ≤ ǫand |pi − p∗i | ≤ ǫ2/(kDiam(Θ)2) for i = 1, . . . , k, thenK(pG0 , pG) ≤ dρK (G0, G) ≤C1

    1≤i,j≤k qij‖θ∗i − θj |2, for any q ∈ Q. Thus, K(pG0 , pG) ≤ C1W 22 (G0, G) ≤C1

    ∑ki=1(p

    ∗i ∧ pi)‖θ∗i − θi‖2 + C1

    ∑ki=1 |pi − p∗i |Diam(Θ)2 ≤ 2C1ǫ2. Hence, under prior

    Π,

    Π(G : K(pG0 , pG) ≤ ǫ2) ≥ Π(G : ‖θi−θ∗i ‖ ≤ ǫ, |pi−p∗i | ≤ ǫ2/(kDiam(Θ)2), i = 1, . . . , k).

    In view of Assumptions (A2) and (A3), we haveΠ(BK(ǫ)) & ǫk(d+2). Conversely, forsufficiently smallǫ, if W2(G0, G) ≤ ǫ then by reordering the index of the atoms, we musthave‖θi − θ∗i ‖ = O(ǫ) and|pi − p∗i | = O(ǫ2) for all i = 1, . . . , k (see the argument in theproof of Lemma 4(c)). This entails that under the priorΠ,

    Π(G : W 22 (G0, G) ≤ ǫ2) ≤ Π(G : ‖θi−θ∗i ‖ ≤ O(ǫ), |pi−p∗i | ≤ O(ǫ2), i = 1, . . . , k) . ǫk(d+2).

    Let ǫn = n−1/2. We proceed by verifying conditions of Theorem 4. LetGn := Gk(Θ).ThenΠ(Ḡ(Θ) \ Gn) = 0, so Eq. (19) trivially holds.

    Next, we show that bothD(ǫ/2, S, W2), whereS = {G ∈ Gn : W2(G0, G) ≤ 2ǫ},andM(Gn, G1, ǫ) are bounded above by a constant, so that (18) is satisfied. Indeed, for

    13

  • any ǫ > 0, log D(ǫ/2, S, W2) ≤ log N(ǫ/4, S, W2) ≤ N(ǫ/4, S, W1). By Lemma 4 (c),N(ǫ/4, S, W1) is bounded in terms ofsupΘ′ log N(ǫ/8, Θ

    ′, ρ), which is bounded aboveby a constant whenΘ′’s are subsets ofΘ whose diameter is bounded by a multiple ofǫ.A similar argument holds forM(Gn, G1, ǫ): Due to strong identifiability and assumption(A2), Ck(Gn, G1, ǫ) ≥ cǫ4 for some constantc > 0. By assumption (A4), and an ap-plication of Lemma 4 (c), for some constantc1 > 0, M(Gn, G1, ǫ) ≤ N(c1ǫ,Gk(Θ) ∩BW (G1, ǫ/2), W2) is bounded by a constant. Thus, Eq. (18) holds.

    By Proposition 1(b) and Assumption (A4), there isCk(Gn, jǫn) = infW2(G0,G)≥jǫ/2 h2(pG0 , pG) ≥c(jǫn)

    4 for some constantc > 0. To ensure condition (21), note that (constantsc changeafter each bounding step):

    exp(2nǫ2n)∑

    j≥Mnexp[−nCk(Gn, jǫn)/16] . exp(2nǫ2n)

    j≥Mnexp[−nc(jǫn)4]

    . exp(2nǫ2n − ncM4nǫ4n).

    This upper bound goes to zero ifncM4nǫ4n ≥ 4nǫ2n, which is satisfied by takingMn to be a

    large multiple ofǫ−1/2n . Thus we needMnǫn ≍ ǫ1/2n = n−1/4.Under the assumptions specified above,Π(G : jǫn < W2(G, G0) ≤ 2jǫn)/Π(BK(ǫn)) =

    O(1). On the other hand, forj ≥ Mn, we haveexp[nCk(Gn, jǫn)/16] ≥ exp[nc(jǫn)4/16]which is bounded below by arbitrarily large constant by choosingMn to be a large multipleof ǫ−1/2n , thereby ensuring (20).

    Thus, by Theorem 4, rate of contraction for the posterior distribution ofG underW2distance metric isn−1/4, which is also the minimax raten−1/4 as proved in the univariatecase by [5].

    4.2 Dirichlet process mixtures

    Given the “true” discrete measureG0 =∑k

    i=1 p∗i δθ∗i ∈ Gk(Θ), whereΘ is a metric space

    but k ≤ ∞ is unknown. To estimateG0, the prior distributionΠ on discrete measureG ∈ Ḡ(Θ) is taken to be a Dirichlet process DP(ν, P0) that centers atP0 with concentrationparameterν > 0 [9]. Here, parameterP0 is a probability measure onΘ. For anym ≥ 1,the following lemma provides a lower bound of small ball probabilities of metric space(Ḡ(Θ), d1/mρm ) in terms of small ballP0-probabilities of metric space(Θ, ρ).

    Lemma 5. Let G ∼ DP(ν, P0), whereP0 is a non-atomic base probability measure on acompact setΘ. For a smallǫ > 0, let D = D(ǫ, Θ, ρ) denote the packing number ofΘunderρ metric. Then, under the Dirichlet process distribution,

    Π(G : dρm(G0, G) ≤ (2m + 1)ǫm) ≥ Γ(ν)[ǫm(2D)−1Diam(Θ)−m]D−1νDD∏

    i=1

    P0(Si).

    Here, (S1, . . . , SD) denotes theD disjoint ǫ/2-balls that form a maximal packing ofΘ.Γ(·) is the gamma function.

    14

  • Proof. Since every point inΘ is of distance at mostǫ to one of the centers ofS1, . . . , SD,there is aD-partition (S′1, . . . , S

    ′D) of Θ, such thatSi ⊆ S′i, and Diam(S′i) ≤ 2ǫ for each

    i = 1, . . . , D. Let mi = G(S′i), µi = P0(S′i), andp̂i = G0(S

    ′i). From the definition of

    Dirichlet processes,m = (m1, . . . , mD) ∼ Dir(νµ1, . . . , νµD). To obtain an upper boundfor dρm(G0, G), consider a coupling betweenG0 andG, by associatingmi ∧ p̂i probabilitymass of supporting atoms forG0 contained in subsetS′i with the same probability mass ofsupporting atoms forG contained in the same subset, for eachi = 1, . . . , D. The remainingmass (of probability‖m − p̂‖) for both measures are coupled in an arbitrary way. Theexpectation under this coupling of theρm distance provides one such upper bound, i.e.:

    dρm(G0, G) ≤ (2ǫ)m + ‖m − p̂‖1[Diam(Θ)]m.

    Due to the non-atomicity ofP0, for ǫ sufficiently small,νµi ≤ 1 for all i = 1, . . . , D.Let δ = ǫ/Diam(Θ). Then, underΠ,

    Pr(dρm(G0, G) ≤ (2m+1)ǫm) ≥ Pr(‖m−p̂‖1 ≤ δm) ≥ Pr(|mi−p̂i| ≤ δm/2D, i = 1, . . . , D−1)

    =Γ(ν)

    ∏Di=1 Γ(νµi)

    ∆D−1∩|mi−p̂i|≤δm/2D

    D−1∏

    i=1

    mνµi−1i (1 −D−1∑

    i=1

    mi)νµD−1dmi

    ≥ Γ(ν)∏D

    i=1 Γ(νµi)

    D−1∏

    i=1

    ∫ min(p̂i+δm/2D,1)

    max(p̂i−δm/2D,0)mνµi−1i dmi ≥ Γ(ν)(δm/2D)D−1

    D∏

    i=1

    (νµi).

    The second inequality in the previous display is due to the fact that‖m−p̂‖1 ≤ 2∑D−1

    i=1 |mi−p̂i|. The third inequality is due to(1 −

    ∑D−1i=1 mi)

    νµD−1 = mνµD−1D ≥ 1, sinceνµD ≤ 1and0 < mD < 1 almost surely. The last inequality is due to the fact thatΓ(α) ≤ 1/α for0 < α ≤ 1. This gives the desired claim.

    Assumptions B.

    (B1) The non-atomic base measureP0 places full support on a compact setΘ ⊂ Rd.Moreover,P0 has a Lebesgue density that is bounded away from zero.

    (B2) For some constantsC1, m1 > 0, K(fi, f ′j) ≤ C1ρm1(θi, θ′j) for anyθi, θ′j ∈ Θ.For anyG ∈ supp(Π),

    pG0(log(pG0/pG))2 ≤ C2K(pG0 , pG)m2 for some con-

    stantsC2, m2 > 0.

    Theorem 6. Given Assumptions (B1) and (B2) and the smoothness conditions for the likeli-hood family as specified in Theorem 2, there is a sequenceβn ց 0 such thatΠ(W2(G0, G) ≥βn|X1, . . . , Xn) → 0 in PG0 probability. Specifically,

    (1) For ordinary smooth likelihood functions, takeβn ≍ (log n/n)2

    (d+2)(4+(2β+1)d)+δ , forany smallδ > 0.

    15

  • (2) For supersmooth likelihood functions, takeβn ≍ (log n)−1/β .

    Proof. The proof consists of two main steps. First, we shall prove that under Assumptions(B1–C2), conditions specified by Eqs. (13) (14) (15) in Theorem 3 are satisfied by takingGn = Ḡ(Θ), which is a convex set, andǫn to be a large multiple of(log n/n)1/(d+2). Thesecond step involves constructing a sequence ofMn andβn = Mnǫn for which Theorem 3can be applied.Step 1: By Lemma 1 and (B2),K(pG0 , pG) ≤ dρK (G0, G) ≤ C1dρm1 (G0, G). Also,∫

    pG0 [log(pG0/pG)]2 . C2dρm1 (G0, G)

    m2 . Assume without loss of generality thatm1 ≤m2 (the other direction is handled similarly). We obtain thatΠ(G ∈ BK(ǫn)) ≥ Π(G :dρm1 (G0, G) ≤ C3ǫ2∨2/m2n ) for some constantC3 > 0.

    From (B1), there is a universal constantc3 > 0 such that for anyǫ and anyD-partition(S1, . . . , SD) specified in Lemma 5, there holds:

    logD∏

    i=1

    P0(Si) ≥ c3D log(1/D).

    Moreover, the packing number satisfiesD ≍ [Diam(Θ)/ǫn]d. Combining these factswith Lemma 5 we have:log Π(G ∈ BK(ǫn)) & (D − 1) log(ǫn/Diam(Θ)) + (2D −1) log(1/D) + D log ν, where the approximation constant is dependent onm1, m2. It issimple to check that condition (15) holds,log Π(G ∈ BK(ǫn)) ≥ −Cnǫ2n, by the givenrate ofǫn, for any constantC > 0.

    SinceGn = Ḡ(Θ), (14) trivially holds. Turning to condition (13), by Lemma 4(b), wehavelog(2ǫn, Ḡ(Θ), W2) ≤ log N(2ǫn, Ḡ(Θ), dρ) ≤ N(ǫn, Θ, ρ) log(e+eDiam(Θ)/ǫn) ≤(Diam(Θ)/ǫn)d log(e + eDiam(Θ)/ǫn) ≤ nǫ2n by the specified rate ofǫn.Step 2:For anyG ⊆ Ḡ(Θ), letRk(G, r) be the inverse of the Hellinger information functionof dρ metric. Specifically, for anyt ≥ 0,

    Rk(G, t) = inf{r ≥ 0|Ck(G, r) ≥ t}.Note thatRk(G, 0) = 0. Rk(G, ·) is non-decreasing becauseCk(G, ·) is.

    Let (ǫn)n≥1 be the sequence determined in the previous step of the proof. LetMn =Rk(Ḡ(Θ), 8ǫ2n(C + 4))/ǫn, andβn = Mnǫn = Rk(Ḡ(Θ), 8ǫ2n(C + 4)). Condition (16)holds by definition ofRk, i.e.,Ck(G(Θ), Mnǫn) ≥ 8ǫ2n(C + 4). To verify (17), note thatthe running sum with respect toj cannot have more than Diam(Θ)/ǫn terms, and due to themonotonicity ofCk, we have

    exp(2nǫ2n)∑

    j≥Mnexp[−nCk(Gn, jǫn)/8] ≤ Diam(Θ)/ǫn exp(2nǫ2n−nCk(Gn, Mnǫn)/8) → 0.

    Hence, Theorem 3 can be applied to conclude thatΠ(dρ(G0, G) ≥ βn|X1, . . . , Xn) → 0in PG0 probability. Under ordinary smoothness condition (as specified in Theorem 2),

    Rk(Ḡ(Θ), t) = t1

    4+(2β+1)d+δ , whereδ is an arbitrarily positive constant. So,βn ≍ ǫ2

    4+(2β+1)d+δn =

    (log n/n)2

    (d+2)(4+(2β+1)d+δ) . On the other hand, under supersmoothness condition,Rk(Ḡ(Θ), t) =(1/ log(1/t))1/β . So,βn ≍ (log(1/ǫn))−1/β ≍ (log n)−1/β .

    16

  • 5 Proofs of main results

    5.1 Identifiability results

    Proof of Theorem 1.

    Proof. Suppose that Eq. (4) is not true, then there will be sequences ofGn andG′n tendingtoG0 in W2 metric, and thatψ(Gn, G′n) → 0. We writeGn =

    ∑∞i=1 pn,iδθn,i , wherepn,i =

    0 for indicesi greater thankn, the number of atoms ofGn. Similar notation is applied toG′n. Since bothGn andG

    ′n have finite number of atoms, there isq

    (n) ∈ Q(pn, p′n) so thatW 22 (Gn, G

    ′n) =

    ij q(n)ij ‖θn,i − θ′n,j‖2.

    Let On = {(i, j) : ‖θn,i − θ′n,j‖ ≤ W2(Gn, G′n)}. This set exists because there arepairs of atomsθn,i, θ′n,j such that‖θn,i − θ′n,j‖ is bounded away from zero in the limit.Sinceq(n) ∈ Q(pn, p′n), we can express

    ψ(Gn, G′n) = sup

    x

    kn∑

    i=1

    pn,if(x|θn,i) −k′n∑

    j=1

    p′n,jf(x|θ′n,j)∣

    /W 22 (Gn, G′n)

    = supx

    ij

    q(n)ij (f(x|θn,i) − f(x|θ′n,j))

    /W 22 (Gn, G′n),

    and, by Taylor’s expansion,

    ψ(Gn, G′n) = sup

    x

    (i,j) 6∈Onq(n)ij (f(x|θ′n,j) − f(x|θn,i)) +

    (i,j)∈Onq(n)ij (θ

    ′n,j − θn,i)T Df(x|θn,i)

    (i,j)∈Onq(n)ij (θ

    ′n,j − θn,i)T D2f(x|θn,i)(θ′n,j − θn,i) +

    Rn(x)

    /W 22 (Gn, G′n)

    =: supx

    |An(x) + Bn(x) + Cn(x) + Rn(x)|/Dn,

    where

    Rn(x) = O

    (

    (i,j)∈Onq(n)ij ‖θn,i − θ′n,j‖2+δ

    )

    = o

    (

    (i,j)∈Onq(n)ij ‖θn,i − θ′n,j‖2

    )

    due to Eq. (3) and the definition ofOn. SoRn(x)/W 22 (Gn, G′n) → 0.The quantitiesAn(x), Bn(x) andCn(x) are linear combinations of elements off(x|θ),

    Df(x|θ) andD2f(x|θ) for differentθ’s, respectively. SinceΘ is compact, subsequencesof Gn andG′n can be chosen so that each of their support points converges to a fixedatom

    17

  • θ∗l , for l = 1, . . . , k∗ ≤ k. After being rescaled, the limits ofAn(x)/Dn, Bn(x)/Dn and

    Cn(x)/Dn are still linear combinations with constant coefficients not depending onx.We shall now argue that not all such coefficients vanish to zero. Suppose this is not the

    case. It follows that for the coefficients ofCn(x)/Dn we have:

    (i,j)∈Onq(n)ij ‖θ′n,j − θn,i‖2/W 22 (Gn, G′n) → 0.

    This implies that∑

    (i,j) 6∈On q(n)ij ‖θ′n,j − θn,i‖2/W 22 (Gn, G′n) → 1. SinceΘ is bounded,

    there exists a pair(i, j) 6∈ On such thatqnij/W 22 (Gn, G′n) does not vanish to zero. Butthen, one of the coefficients ofAn(x)/Dn does not vanish to zero, which contradicts thehypothesis.

    Next, we observe that some of the coefficients ofAn(x)/Dn, Bn(x)/Dn andCn(x)/Dnmay tend to infinity. For eachn, let dn be the inverse of the maximum coefficient ofAn(x)/Dn, Bn(x)/Dn, andCn(x)/Dn. From the conclusion in the preceding paragraph,|dn| is uniformly bounded from above by a constant for alln. MoreoverdnAn(x)/Dnconverges to

    ∑k∗

    j=1 αjf(x|θ∗j ) and dnBn(x)/Dn converges to∑k∗

    j=1 βTj Df(x|θ∗j ), and

    dnCn(x)/Dn converges to∑k∗

    j=1 γjD2f(x|θ∗j )γj , for some finiteαj , βj and γj , not all

    of them vanishing (in fact, at least one is them is 1). We have

    dn|pGn(x)− pG′n(x)|/d2ρ(Gn, G

    ′n) →

    k∗∑

    j=1

    αjf(x|θ∗j ) + βTj Df(x|θ∗j ) + γTj D2f(x|θ∗j )γj∣

    .

    (22)for all x. This entails that the right side of the preceeding display must be 0 for all almostall x. By strong identifiability, all coefficients must be 0, which leads to contradiction.

    With respect toψ1(G, G′), suppose that the claim is not true, which implies the ex-istence of a subsequenceGn, G′n such that thatψ1(Gn, G

    ′n) → 0. Going through the

    same argument as above, we haveαj , βj , γj , not all of which are zero, such that Eq.(22)holds. An application of Fatou’s lemma yields

    |∑k∗j=1 αjf(x|θj) + βTj Df(x|θj) +γTj D

    2f(x|θj)γj |dµ = 0. Thus the integrand must be 0 for almost allx, leading to con-tradiction.

    Proof of Theorem 2.

    Proof. To obtain an upper bound ofW 22 (G, G′) = dρ2(G, G

    ′) in terms ofV (pG, pG′) underthe condition thatV (pG, pG′) → 0, our strategy is approximateG andG′ by convolvingthese with some mollifierKδ. By triangular inequality,W2(G, G′) ≤ W2(G, G ∗ Kδ) +W2(G

    ′, G′ ∗ Kδ) + W2(G ∗ Kδ, G′ ∗ Kδ). The first two terms are simple to bound, whilethe last term can be handled by expressingG ∗ Kδ as the convolution the mixture densitypG with another function. We also need the following elementary lemma (whose proof isgiven Section 5.3).

    18

  • Lemma 6. Assume thatp andp′ are two probability density functions onRd with boundeds-moments.

    (a) For t such that0 < t < s,∫

    |p(x) − p′(x)|‖x‖tdx ≤ 2‖p − p′‖(s−t)/sL1 (Ep‖X‖s + Ep′‖X‖s)t/s.

    (b) LetVd = πd/2Γ(d/2 + 1) denote the volume of thed-dimensional unit sphere. Then,

    ‖p − p′‖L1 ≤ 2Vs/(d+2s)d (Ep‖X‖s + Ep′‖X‖s)

    dd+2s ‖p − p′‖

    2sd+2s

    L2.

    Take anys > 0, and letK : Rd → (0,∞) be a symmetric density function onRdwhose Fourier transform̃K is a continuous function whose support is bounded in[−1, 1]d.Moreover,K has bounded moments up to orders. Consider molifiersKδ(x) = 1δd K(x/δ)

    for δ > 0. Let K̃δ andf̃ be the Fourier transforms forKδ andf , respectively. Definegδ tobe the inverse Fourier transform of̃Kδ/f̃ :

    gδ(x) =1

    (2π)d

    Rd

    ei〈ω,x〉K̃δ(ω)

    f̃(ω)dω.

    Note that functionK̃δ(ω)/f̃(ω) has bounded support. So,gδ ∈ L1(R), and g̃δ :=K̃δ(ω)/f̃(ω) is the Fourier transform ofgδ. By the convolution theorem,f ∗ gδ = Kδ. Asa result,

    G ∗ Kδ = G ∗ f ∗ gδ = pG ∗ gδ.From the definition ofKδ, the second moment underKδ is O(δ2). Consider a coupling

    G andG∗Kδ under which we have a pair of random variables(θ, θ+ǫ) whereǫ is indepen-dent ofθ, the marginal distributions ofθ andǫ areG Kδ, respectively. Under this coupling,E‖(θ + ǫ) − θ‖2 = O(δ2), which entails thatW 22 (G, G ∗ Kδ) = O(δ2).

    By triangular inequality,W2(G, G′) ≤ W2(G ∗ Kδ, G′ ∗ Kδ) + O(δ), so for someconstantC(K) > 0 dependent only on kernelK,

    W 22 (G, G′) ≤ 2W 22 (G ∗ Kδ, G′ ∗ Kδ) + C(K)δ2. (23)

    Theorem 6.15 of [32] provides an upper bound for the Wasserstein distance: for anytwo probability measuresµ andν, W 22 (µ, ν) ≤ 2

    ‖x‖2d|µ − ν|(x), where|µ − ν| is thetotal variation of measure|µ − ν|. Thus,

    W 22 (G ∗ Kδ, G′ ∗ Kδ) ≤ 2∫

    ‖x‖2|G ∗ Kδ(x) − G′ ∗ Kδ(x)|dx. (24)

    We note that since density functionK has boundeds-th moment,∫

    ‖x‖sG ∗Kδ(dx) ≤2s

    [

    ‖θ‖sdG(θ) +∫

    ‖x‖sKδ(x)dx]

    = 2s[

    ‖θ‖sdG(θ) + δs∫

    ‖x‖sK(x)dx]

    < ∞,

    19

  • becauseG’s support points lie in a compact set. Applying Lemma 6 to Eq.(24), we obtainthat forδ < 1,

    W 22 (G ∗ Kδ, G′ ∗ Kδ) ≤ C(d, K, s)‖G ∗ Kδ − G′ ∗ Kδ‖(s−2)/sL1

    ≤ C(d, K, s)‖G ∗ Kδ − G′ ∗ Kδ‖2(s−2)/(d+2s)L2 . (25)

    Here, constantsC(d, K, s) are different in each line, and they are dependent only ond, sand thes-th moment of density functionK.

    Next, we use a known fact that for arbitrary (signed) measureµ on Rd and functiong ∈ L2(Rd), there holds‖µ ∗ g‖L2 ≤ |µ|‖g‖L2 , where|µ| denotes the total variation ofµ:

    ‖G∗Kδ−G′∗Kδ‖L2 = ‖pG∗gδ−pG′∗gδ‖L2 = ‖(pG−pG′)∗gδ‖L2 ≤ 2V (pG, pG′)‖gδ‖L2 .(26)

    By Plancherel’s identity,

    ‖gδ‖2L2 =1

    (2π)d

    K̃δ(ω)2

    f̃(ω)2dω =

    1

    (2π)d

    Rd

    K̃(ωδ)2

    f̃(ω)2dω ≤ C

    [−1/δ,1/δ]df̃(ω)−2dω.

    The last bound holds becausẽK has support in[−1, 1]d, and is bounded by a constant.Collecting Eqs.(23)(24)(25)(26) and the preceeding display, we have:

    W 22 (G, G′) ≤ C(d, K, s)

    {

    infδ∈(0,1)

    δ2 + V (pG, pG′)2(s−2)d+2s

    [∫

    [−1/δ,1/δ]df̃(ω)−2dω

    ]s−2d+2s

    }

    .

    If |f̃(ω)∏dj=1 |ωj |β | ≥ d0 asωj → ∞(j = 1, . . . , d) for some positive constantd0,then

    W 22 (G, G′) ≤ C(d, K, s, β)

    {

    infδ∈(0,1)

    δ2 + V (pG, pG′)2(s−2)d+2s (1/δ)

    (2β+1)d(s−2)d+2s

    }

    ≤ C(d, K, s, β)V (pG, pG′)4(s−2)

    2(d+2s)+(2β+1)d(s−2) .

    The exponent tends to4/(4 + (2β + 1)d) as s → ∞, we obtain thatW 22 (G, G′) ≤C(d, β, r)V (pG, pG′)

    r, for any constantr < 4/(4 + (2β + 1)d), asV (pG, pG′) → 0.If |f̃(ω)∏dj=1 exp(|ωj |β) ≥ d0 asωj → ∞(j = 1, . . . , d) for some positive constants

    β, d0, then

    W 22 (G, G′) ≤ C(d, K, s, β)

    {

    infδ∈(0,1)

    δ2 + V (pG, pG′)2(s−2)/(d+2s) exp−2dδ−β s − 2

    d + 2s

    }

    .

    Takingδ−β = −1d log V (pG, pG′), we obtain thatdρ2(G, G′) ≤ C(d, β)(− log V (pG, pG′))−2/β .

    20

  • Proof of Lemma 1.

    Proof. We exploit the variational characterization off -divergences [24],

    ρφ(fi, f′j) = sup

    ϕij

    ϕijf′j − φ∗(ϕij)fidµ,

    where the infimum is taken over all measurable functions defined onX . φ∗ denotes theLegendre-Fenchel conjugate dual of convex functionφ. (φ∗ is again a convex function onR and is defined byφ∗(v) = supu∈R(uv − φ(u)).) For anyq ∈ Q(p, p′),

    ρφ(pG, pG′) = supϕ

    ϕpG′ − φ∗(ϕ)pG = supϕ

    ϕ∑

    j

    p′jf′j − φ∗(ϕ)

    i

    pifi

    = supϕ

    ϕ∑

    ij

    qijf′j − φ∗(ϕ)

    ij

    qijfi = supϕ

    ij

    qij(ϕf′j − φ∗(ϕ)fi)

    ≤∑

    ij

    qij supϕij

    (ϕf ′j − φ∗(ϕ)fi) =∑

    ij

    qij supϕij

    ρφ(fi, f′j),

    where the last inequality holds because the supremum is taken over a largerset of functions.Moreover, the bound holds for anyq ∈ Q(p, p′), soρφ(pG, pG′) ≤ dρφ(G, G′).

    Remark. The anonymous referee suggested a shorter proof: By the variationalcharacter-ization,ρφ is a convex function functional (jointly of its two arguments). Thus, for anycou-pling Q of two mixing measuresG andG′, ρφ(pG, pG′) = ρφ(

    f(·|θ)dG,∫

    f(·|θ′)dG′) =ρφ(

    f(·|θ)dQ,∫

    f(·|θ)dQ) ≤∫

    ρφ(f(·|θ), f(·|θ′))dQ, where the inequality is obtainedvia Jensen’s inequality. Since this holds for anyQ, the desired bound follows.

    Proof of Proposition 1.

    Proof. (a) Suppose that the claim is not true, there is a sequence of(G0, G) ∈ Gk(Θ) × Gsuch thatW2(G0, G2) ≥ r/2 > 0 always holds, and that converges inW2 metric toG∗0 ∈Gk andG∗ ∈ G, respectively. This is due to the compactness of bothGk(Θ) andG. We musthaveW2(G∗0, G

    ∗) ≥ r/2 > 0, soG∗0 6= G∗. At the same time,h(pG∗0 , pG∗) = 0, whichimplies thatpG∗0 = pG∗ for almost allx ∈ X . By finite identifiability condition,G

    ∗0 = G

    ∗,which is a contradiction.

    (b) is an immediate consequence of Theorem 1, by noting that under the given hy-pothesis, there isc(k) > 0 depending onk, such thatd2h(pG0 , pG) ≥ V 2(pG0 , pG)/2 ≥c(k, G0)W

    42 (G0, G) ≥ c(k, G0)W 42 (G0, G) for sufficiently smallW2(G0, G). The bound-

    edness ofΘ implies the boundedness ofW2(G0, G), thereby extending the claim for theentire admissible range ofW2(G0, G). (c) is obtained in a similar way from Theorem 2.

    21

  • 5.2 Proofs of Theorems 3 and 4

    We outline in this section the proofs of Theorems 3 and 4. Our proof follows the same stepsas in [13], with suitable modifications for the inclusion of the Hellinger information func-tion, and the conditions involving latent mixing measures. The proof consists of results onthe existence of tests, which are turned into probability bounds on the posterior contraction.

    Proof of Lemma 2.

    Proof. We first consider Case (I). DefineP1 = {pG|G ∈ G ∩ BW (G1, r/2)}. Sinceρ is ametric inΘ, it is a standard fact of Wasserstein metrics thatBW (G1, r/2) is a convex set.SinceG is also convex, so is the setG ∩ BW (G1, r/2). It follows thatP1 is a convex set ofmixture distributions. Now, applying a result from Birgé [3] and Le Cam ( [19], Lemma 4,pg. 478) there exist testsϕn that discriminate betweenPG0 and convex setP1 such that:

    PG0ϕn ≤ exp[−n inf h2(PG0 , P1)/2]. (27)sup

    G∈G∩BW (G1,r/2)PG(1 − ϕn) ≤ exp[−n inf h2(PG0 , P1)/2], (28)

    where the exponent in the upper bounds are given by the infimum Hellingerdistance amongall P1 ∈ convP1 = P1. Due to the triangle inequality, ifW2(G0, G1) = r andW2(G1, G) ≤r/2 thenW2(G0, G) ≥ r/2. SoCk(G, r) = infG∈G:W2(G0,G)≥r/2 h2(pG0 , pG) ≤ inf h2(pG0 , P1).This completes the proof of Case (I).

    Turning to Case (II), for a constantc0 > 0 to be determined, consider a maximalc0r-packing inW2 metric in setG ∩ BW (G1, r/2). This yields a set ofM(G, G1, r) =D(cor,G ∩BW (G1, r/2), W2) pointsG̃1, . . . , G̃M ∈ G ∩BW (G1, r/2). (In the followingwe drop the subscripts ofM(·)).

    We note the following fact: For anyt = 1, . . . , M , if G ∈ G ∩ BW (G1, r/2) andW2(G, G̃t) ≤ c0r, by Lemma 1 we haveh2(pG, pG̃t) ≤ dρ2h(G, G̃t) ≤ C1dρ2α(G, G̃t) ≤C1Diam(Θ)2(α−1)W 22 (G, G̃t) ≤ C1Diam(Θ)2(α−1)c20r2 (the second inequality is due tothe condition on the likelihood functions); and so it follows thath(pG0 , pG) ≥ h(pG0 , pG̃t)−h(pG, pG̃t) ≥ Ck(G, r)

    1/2 − C1Diam(Θ)2(α−1)c20r2. Choosec0 =Ck(G,r)1/4

    rDiam(Θ)α−1√

    2C1

    so that the previous bounds becomeh(pG, pG̃t) ≤ Ck(G, r)1/2/2 ≤ h(pG0 , pG̃t)/2 and

    h(pG0 , pG) ≥ Ck(G, r)1/2/2.For each pair ofG0, G̃t, there exist testsω

    (t)n of pG0 versus the closed Hellinger ball

    {pG : h(pG, pG̃t) ≤ h(pG0 , pG̃t)/2} such that

    PG0ω(t)n ≤ exp[−nh2(PG0 , PG̃t)/8],

    supG∈Ḡ(Θ):h(pG,pG̃t )≤h(pG0 ,pG̃t )/2

    PG(1 − ω(t)n ) ≤ exp[−nh2(PG0 , PG̃t)/8].

    22

  • Consider the testϕn = maxt≤M ω(t)n , then

    PG0ϕn ≤ M exp[−nCk(G, r)/8],sup

    G∈G∩BW (G1,r/2)PG(1 − ϕn) ≤ exp[−nCk(G, r)/8].

    The first inequality is due toϕn ≤∑M

    t=1 ω(t)n , and the second is due to the fact that for

    any G ∈ G ∩ BW (G1, r/2) there is somet ≤ M such thatW2(G, G̃t) ≤ c0r, so thath(pG, pG̃t) ≤ h(pG0 , pG̃t)/2.

    Proof of Lemma 3.

    Proof. For a givent ∈ N choose a maximaltǫ/2-packing for setSt = {G : tǫ <W2(G0, G) ≤ (t + 1)ǫ}. This yields a setS′t of at mostD(tǫ/2, St, W2) points. Moreover,everyG ∈ St is within distancetǫ/2 of at least one of the points inS′t. For every such pointG1 ∈ S′t, there exists a testωn satisfying Eqs. (7) and (8). Takeϕn to be the maximum ofall tests attached this way to some pointG1 ∈ S′t for somet ≥ t0. Then, by union bound,and the fact thatD(ǫ) is non-increasing,

    PG0ϕn ≤∑

    t≥t0

    G1∈S′t

    M(G, G1, tǫ) exp[−nCk(G, tǫ)/8] ≤ D(ǫ)∑

    t≥t0exp[−nCk(G, tǫ)/8]

    supG∈∪u≥t0Su

    PG(1 − ϕn) ≤ supu≥t0

    exp[−nCk(G, uǫ)/8] ≤ exp[−nCk(G, t0ǫ)/8],

    where the last inequality is due the monotonicity ofCk(G, ·).

    Proof of Theorem 3 and 4 By a result of Ghosal et al [13] (Lemma 8.1, pg. 524), foreveryǫ > 0 and probability measureΠ on the setBK(ǫ) defined by Eq. (12), we have, foreveryC > 0,

    PG0

    (∫ n

    i=1

    pG(Xi)

    pG0(Xi)dΠ(G) ≤ exp(−(1 + C)nǫ2)

    )

    ≤ 1C2nǫ2

    .

    This entails that, for a fixedC ≥ 1, there is an eventAn with PG0-probability at least1 − (Cnǫ2n)−1, for which there holds:

    ∫ n∏

    i=1

    pG(Xi)/pG0(Xi)dΠ(G) ≥ exp(−2nǫ2n)Π(BK(ǫn)). (29)

    Let On = {G ∈ Ḡ(Θ) : W2(G0, G) ≥ Mnǫn}, Sn,j = {G ∈ Gn : W2(G0, G) ∈[jǫn, (j + 1)ǫn)} for eachj ≥ 1. The conditions specified by Lemma 3 are satisfied by

    23

  • settingD(ǫ) = exp(nǫ2n) (constant inǫ). Thus there exist testsϕn for which Eq. (10) and(11) hold. Then,

    PG0Π(G ∈ On|X1, . . . , Xn)= PG0 [ϕnΠ(G ∈ On|X1, . . . , Xn)] + PG0 [(1 − ϕn)Π(G ∈ On|X1, . . . , Xn)]≤ PG0 [ϕnΠ(G ∈ On|X1, . . . , Xn)] + PG0I(Acn) + PG0 [(1 − ϕn)Π(G ∈ On|X1, . . . , Xn)I(An)].

    Exploiting Lemma 3, all terms in the preceeding display can be shown to vanish asn → ∞.The proof for Theorem 3 proceeds in a similar way to Theorem 2.1 of [13], while the prooffor Theorem 4 is similar to their Theorem 2.4.

    5.3 Proof of other auxiliary lemmas

    Proof of Lemma 4.

    Proof. (a) Suppose that(η1, . . . , ηT ) forms anǫ-covering forΘ under metricρ, whereT = N(ǫ, Θ, ρ) denotes the (minimum) covering number. Take any discrete measureG =∑k

    i=1 piδθi . For eachθi there is an approximatingθ′i among theηj ’s such thatρ(θi, θ

    ′i) < ǫ.

    Let p′ = (p′1, . . . , p′k) be ak-dim vector in the probability simplex that deviates fromp by

    less thanδ in l1 distance:‖p′ − p‖1 ≤ δ. DefineG′ =∑k

    i=1 p′iδθ′i . We shall argue that

    dρ(G, G′) ≤

    k∑

    i=1

    (pi ∧ p′i)ρ(θi, θ′i) + ‖p − p′‖1Diam(Θ) ≤ ǫ + δDiam(Θ).

    (To see this, consider the following coupling betweenG andG′: associatingpi ∧ p′i proba-bility mass ofθi (from G with the same probability mass ofθ′i from G

    ′, while the remainingmass fromG andG′ (of probability‖p − p′‖1 ) are coupled in an arbitrary way. The righthand side of the previous display is an upper bound of the expectation of the ρ distancebetween two random variables distributed according to the described coupling.)

    It follows that a(ǫ + δDiam(Θ)-covering forGk(Θ) can be constructed by combin-ing each element of aδ-covering in l1 metric of thek − 1-probability simplex andkǫ-covering’s ofΘ. Now, the covering number ofk − 1-probability simplex is less thanthe number of cubes of lengthδ/k covering[0, 1]k times the volume of{(p′1, . . . , p′k) :p′j ≥ 0,

    j p′j ≤ 1 + δ}, i.e., (k/δ)k(1 + δ)k/k! ∼ (1 + 1/δ)kek/

    √2πk. It follows that

    N(ǫ + δDiam(Θ),Gk(Θ), dρ) ≤ T k(1 + 1/δ)kek/√

    2πk. Takeδ = ǫ/Diam(Θ) to achievethe claim.

    (b) As before, let(η1, . . . , ηT ) be anǫ-covering ofΘ. Take anyG =∑k

    i=1 piδθi ∈Ḡ(Θ), wherek may be infinity. The collection of atomsθ1, . . . , θk can be subdivided intodisjoint subsetsS1, . . . , ST , some of which may be empty, so that for eacht = 1, . . . , T ,ρ(θi, ηt) ≤ ǫ for anyθi ∈ St. Definep′t =

    ∑ki=1 piI(θi ∈ St), and letG′ =

    ∑Tt=1 p

    ′tδηt ,

    then we are guaranteed thatdρ(G, G′) ≤∑k

    i=1

    ∑Tt=1 piI(θi ∈ St)ρ(θi, ηt) ≤ ǫ, by using a

    similar coupling argument as in part (a).

    24

  • Let p′′ = (p′′1, . . . , p′′T ) be aT -dim vector in the probability simplex that deviates from

    p′ by less thanδ in l1 distance:‖p′′ − p′‖1 ≤ δ. TakeG′′ =∑T

    t=1 p′′t δηt . It is simple to

    observe thatdρ(G′, G′′) ≤ Diam(Θ)δ. By triangle inequality,dρ(G, G′′) ≤ dρ(G, G′) +dρ(G

    ′, G′′) ≤ ǫ + δDiam(Θ).The foregoing arguments establish that an(ǫ + δDiam(Θ))-covering in the Wasserstein

    metric for Ḡ(Θ) can be constructed by combining each element of theδ-covering in l1of the T − 1 simplex and a single covering ofΘ. From the proof of part (a),N(ǫ +δDiam(Θ), Ḡ(Θ), dρ) ≤ (1 + 1/δ)T eT /

    √2πT . Takeδ = ǫ/Diam(Θ) to conclude.

    (c) Consider aG =∑k

    i=1 piδθi such thatdρ(G0, G) ≤ 2ǫ. By definition, there is acouplingq ∈ Q(p, p∗) so that

    ij qijρ(θ∗i , θj) ≤ 2ǫ. Since

    j qij = p∗i , this implies

    that 2ǫ ≥ ∑ki=1 p∗i minj ρ(θ∗i , θj). Thus, for eachi = 1, . . . , k there is aj such thatρ(θ∗i , θj) ≤ 2ǫ/p∗i ≤ 2Mǫ. Without loss of generality, assume thatρ(θ∗i , θi) ≤ 2Mǫ for alli = 1, . . . , k. For sufficiently smallǫ, for any i, it is simple to observe thatdρ(G0, G) ≥|p∗i − pi|minj 6=i ρ(θ∗i , θj) ≥ |p∗i − pi|minj ρ(θ∗i , θ∗j )/2. Thus,|p∗i − pi| ≤ 4ǫ/m.

    Now, anǫ/4 + δDiam(Θ) covering indρ for {G ∈ Gk(Θ) : dρ(G0, G) ≤ 2ǫ} canbe constructed by combining theǫ/4-covering for each of thek sets{θ ∈ Θ : ρ(θ, θ∗i ) ≤2Mǫ} and theδ/k-covering for each of thek sets[p∗i −4ǫ/m, p∗i +4ǫ/m]. This entails that:N(ǫ/4+δDiam(Θ), {G ∈ Gk(Θ) : dρ(G0, G) ≤ 2ǫ}, dρ) ≤ [supΘ′ N(ǫ/4, Θ′, ρ)]k(8ǫk/mδ)k.Takeδ = ǫ/(4Diam(Θ)) to conclude the proof.

    Proof of Lemma 6.

    Proof. (a) For arbitrary constantR > 0, we have∫

    |p(x) − p′(x)|‖x‖tdx ≤∫

    ‖x‖≤R |p −p′|‖x‖t +

    ‖x‖≥R(p+p′)‖x‖t ≤ Rt‖p−p′‖L1 +R−(s−t)(Ep‖X‖s +Ep′‖X‖s). Choosing

    R = [(Ep‖X‖s + Ep′‖X‖s)/|p − p′|L1 ]1/s to conclude.(b) For anyR > 0, we have

    ‖x‖≤R |p(x) − p′(x)|dx ≤ V1/2d R

    d/2[∫

    ‖x‖≤R(p(x) −p′(x))2dx]1/2 ≤ V 1/2d Rd/2‖p−p′‖L2 . We also have

    ‖x‖≥R |p(x)−p′(x)|dx ≤∫

    ‖x‖≥R p(x)+

    p′(x)dx ≤ R−s(Ep‖X‖s + Ep′‖X‖s). Thus,‖p− p′‖L1 ≤ infR>0 V1/2d R

    d/2‖p− p′‖L2 +R−s(Ep‖X‖s + Ep′‖X‖s), which gives the desired bound.

    References

    [1] A. Barron, M. Schervish, and L. Wasserman. The consistency ofposterior distribu-tions in nonparametric problems.Ann. Statist, 27:536–561, 1999.

    [2] P. Bickel and D. Freedman. Some asymptotic theory for the bootstrap.Annals ofStatistics, 9(6):1196–1217, 1981.

    [3] L. Birgé. Sur un th́eor̀em de minimax et son application aux tests.Probab. Math.Statist., 3:259–282, 1984.

    25

  • [4] R. J. Carroll and P. Hall. Optimal rates of convergence for deconvolving a density.Journal of American Statistical Association, 83:1184–1186, 1988.

    [5] J. Chen. Optimal rate of convergence for finite mixture models.Annals of Statistics,23(1):221–233, 1995.

    [6] E. del Barrio, J. Cuesta-Albertos, C. Matrán, and J. Rodrı́guez-Rodŕıguez. Tests ofgoodness of fit based on thel2-wasserstein distance.Annals of Statistics, 27(4):1230–1239, 1999.

    [7] R. M. Dudley.Probabilities and metrics: Convergence of laws on metric spaces, witha view to statistical testing. Aarhus Universitet, 1976.

    [8] J. Fan. On the optimal rates of convergence for nonparametric deconvolution prob-lems.Annals of Statistics, 19(3):1257–1272, 1991.

    [9] T.S. Ferguson. A Bayesian analysis of some nonparametric problems.Ann. Statist.,1:209–230, 1973.

    [10] A.E. Gelfand, A. Kottas, and S.N. MacEachern. Bayesian nonparametric spatial mod-eling with Dirichlet process mixing.J. Amer. Statist. Assoc., 100:1021–1035, 2005.

    [11] C. Genovese and L. Wasserman. Rates of convergence for the gaussian mixture sieve.Annals of Statistics, 28:1105–1127, 2000.

    [12] S. Ghosal, J. K. Ghosh, and R. V. Ramamoorthi. Posterior consistency of dirichletmixtures in density estimation.Ann. Statist., 27:143–158, 1999.

    [13] S. Ghosal, J. K. Ghosh, and A. van der Vaart. Convergence rates of posterior distribu-tions. Ann. Statist., 28(2):500–531, 2000.

    [14] S. Ghosal and A. van der Vaart. Convergence rates of posterior distributions for noniidobservations.Ann. Statist., 35(1):192–223, 2007.

    [15] S. Ghosal and A. van der Vaart. Posterior convergence rates of dirichlet mixtures atsmooth densities.Ann. Statist., 35:697–723, 2007.

    [16] N. Hjort, C. Holmes, P. Mueller, and S. Walker.Bayesian Nonparametrics: Principlesand Practice. Cambridge University Press, 2010.

    [17] H. Ishwaran, L. James, and J. Sun. Bayesian model selection in finite mixturesby marginal density decompositions.Journal of American Statistical Association,96(456):1316–1332, 2001.

    [18] H. Ishwaran and M. Zarepour. Dirichlet prior sieves in finite normal mixtures.Statis-tica Sinica, 12:941–963, 2002.

    [19] L. LeCam.Asymptotic methods in statistical decision theory. Springer-Verlag, 1986.

    26

  • [20] B. Lindsay. Mixture models: Theory, Geometry and Applications. NSF-CBMSRegional Conference Series in Probability and Statistics, Institute of MathematicalStatistics, Hayward, CA, 1995.

    [21] C. Mallows. A note on asymptotic joint normality.Annals of Mathematical Statistics,43:508–515, 1972.

    [22] G. McLachlan and K. Basford.Mixture models: Inference and Applications to Clus-tering. Marcel-Dekker, New York, 1988.

    [23] X. Nguyen. Inference of global clusters from locally distributed data.Bayesian Anal-ysis, 5(4):817–846, 2010.

    [24] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals andthe likelihood ratio by convex risk minimization.IEEE Transactions on InformationTheory, 56(11):5847–5861, 2010.

    [25] S. Petrone, M. Guidani, and A.E. Gelfand. Hybrid Dirichlet processes for functionaldata.Journal of the Royal Statistical Society B, 71(4):755–782, 2009.

    [26] A. Rodriguez, D. Dunson, and A.E. Gelfand. The nested Dirichletprocess.J. Amer.Statist. Assoc., 103(483):1131–1154, 2008.

    [27] X. Shen and L. Wasserman. Rates of convergence of posterior distributions. Ann.Statist., 29:687–714, 2001.

    [28] Y.W. Teh, M.I. Jordan, M.J. Beal, and D.M. Blei. Hierarchical Dirichlet processes.J.Amer. Statist. Assoc., 101:1566–1581, 2006.

    [29] H. Teicher. On the mixture of distributions.Ann. Math. Statist., 31:55–73, 1960.

    [30] H. Teicher. Identifiability of mixtures.Ann. Math. Statist., 32:244–248, 1961.

    [31] Cédric Villani. Topics in Optimal Transportation. American Mathematical Society,2003.

    [32] Cédric Villani. Optimal transport: Old and New. Springer, 2008.

    [33] S. Walker. New approaches to bayesian consistency.Ann. Statist., 32(5):2028–2043,2004.

    [34] S. Walker, A. Lijoi, and I. Prunster. On rates of convergence for posterior distributionsin infinite-dimensional models.Ann. Statist., 35(2):738–746, 2007.

    [35] C. Zhang. Fourier methods for estimating mixing densities and distributions. Annalsof Statistics, 18(2):806–831, 1990.

    27

    IntroductionWasserstein distances for mixing measuresDefinition and a basic inequalityWasserstein metric identifiability in finite mixture modelsWasserstein metric identifiability in infinite mixture models

    Convergence of posterior distributions of mixing measuresExamplesFinite mixtures of multivariate distributionsDirichlet process mixtures

    Proofs of main resultsIdentifiability resultsProofs of Theorems 3 and 4Proof of other auxiliary lemmas