on two ways to use determinantal point processes …...dpp( ;k n) defines a probability measure...

21
On two ways to use determinantal point processes for Monte Carlo integration Guillaume Gautier 12 emi Bardenet 1 Michal Valko 32 Abstract This paper focuses on Monte Carlo integration with determinantal point processes (DPPs) which enforce negative dependence between quadrature nodes. We survey the properties of two unbiased Monte Carlo estimators of the integral of inter- est: a direct one proposed by Bardenet & Hardy (2016) and a less obvious 60-year-old estimator by Ermakov & Zolotukhin (1960) that actually also relies on DPPs. We provide an efficient implemen- tation to sample exactly a particular multidimen- sional DPP called multivariate Jacobi ensemble. This let us investigate the behavior of both estima- tors on toy problems in yet unexplored regimes. 1. Introduction Numerical integration is a core task of many machine learn- ing applications, including most Bayesian methods (Robert, 2007). Both deterministic and random algorithms have been proposed (Evans & Swartz, 2000). All methods require com- bining evaluations of the integrand at so-called quadrature nodes to minimize the approximation error. We are motivated by a stream of work which makes use of prior knowledge on the smoothness of the integrand using kernels and RKHSs. Oates et al. (2017) and Liu & Lee (2017) made use of kernel-based control variates, splitting the computational budget into regressing the integrand and integrating the residual. Bach (2017) looked for the best way to sample i.i.d. nodes and combine the resulting evaluations. Bayesian quadrature (O’Hagan, 1991; Husz ´ ar & Duvenaud, 2012; Briol et al., 2015), herding (Chen et al., 2010; Bach et al., 2012), or the biased importance sampling estimate of Delyon & Portier (2016) all favor dissimilar quadrature nodes, where dissimilarity is measured by a kernel. Our work falls within this last cluster. 1 Univ. Lille, CNRS, Centrale Lille, UMR 9189 - CRIStAL, France 2 INRIA Lille - Nord Europe, France 3 DeepMind Paris. Correspondence to: Guillaume Gautier <[email protected]>. Proceedings of the Workshop on Negative Dependence in Machine Learning at the 36 th International Conference on Machine Learn- ing, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). We build on the particular approach of Bardenet & Hardy (2016) for Monte Carlo integration based on projection determinantal point processes (DPPs, Hough et al., 2006; Kulesza & Taskar, 2012). DPPs are a repulsive distribu- tion over configurations of points; repulsiveness is again parametrized by a kernel. In a sense, DPPs are the kernel machines of point processes. Our contributions. First, we point out a mostly forgotten Monte Carlo estimator derived by Ermakov & Zolotukhin (1960) (EZ, 1960) that implicitly but crucially requires sam- pling from a DPP, more than a decade before Macchi (1975) even formalized DPPs! Second, we provide a simple proof of their result and survey the properties of the estimator with modern arguments. In particular, when the integrand is a linear combination of the eigenfunctions of the underly- ing DPP kernel, the corresponding Fourier-like coefficients can be estimated with zero variance. From one sample of the corresponding DPP, perfect reconstruction of the signal is granted by solving a linear system. Third, we propose an efficient Python implementation for sampling exactly a particular DPP called multivariate Jacobi ensemble. This implementation allows to numerically investigate the behav- ior of the two Monte Carlo estimators derived by Bardenet & Hardy (2016) and Ermakov & Zolotukhin (1960), in regimes yet unexplored for any of the two. Our point is not to compare DPP-based Monte Carlo estimators to the wide choice of numerical integration algorithms, but to get a fine understanding of their properties so as to fine-tune their design and guide theoretical developments. 2. Quadrature, DPPs, and the multivariate Jacobi ensembles 2.1. Standard quadrature Let μ(dx)= ω(x)dx be a positive Borel measure on X R d with finite mass and density w w.r.t. the Lebesgue measure. This paper aims to compute integrals of the form R f (x)μ(dx) for some test function f : X R. A quadra- ture rule approximates such integrals as a weighted sum of evaluations of f at some points {x 1 ,...,x N }⊂ X, Z f (x)μ(dx) N X n=1 w n f (x n ), (1) with w n , w n (x 1 ,...,x N ) R do not need to sum to one.

Upload: others

Post on 16-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: On two ways to use determinantal point processes …...DPP( ;K N) defines a probability measure over sets since (3) is invariant by permutation, the orthonormality of the ˚ ks yields

On two ways to use determinantal point processes for Monte Carlo integration

Guillaume Gautier 1 2 Remi Bardenet 1 Michal Valko 3 2

Abstract

This paper focuses on Monte Carlo integrationwith determinantal point processes (DPPs) whichenforce negative dependence between quadraturenodes. We survey the properties of two unbiasedMonte Carlo estimators of the integral of inter-est: a direct one proposed by Bardenet & Hardy(2016) and a less obvious 60-year-old estimator byErmakov & Zolotukhin (1960) that actually alsorelies on DPPs. We provide an efficient implemen-tation to sample exactly a particular multidimen-sional DPP called multivariate Jacobi ensemble.This let us investigate the behavior of both estima-tors on toy problems in yet unexplored regimes.

1. IntroductionNumerical integration is a core task of many machine learn-ing applications, including most Bayesian methods (Robert,2007). Both deterministic and random algorithms have beenproposed (Evans & Swartz, 2000). All methods require com-bining evaluations of the integrand at so-called quadraturenodes to minimize the approximation error.

We are motivated by a stream of work which makes use ofprior knowledge on the smoothness of the integrand usingkernels and RKHSs. Oates et al. (2017) and Liu & Lee(2017) made use of kernel-based control variates, splittingthe computational budget into regressing the integrand andintegrating the residual. Bach (2017) looked for the best wayto sample i.i.d. nodes and combine the resulting evaluations.Bayesian quadrature (O’Hagan, 1991; Huszar & Duvenaud,2012; Briol et al., 2015), herding (Chen et al., 2010; Bachet al., 2012), or the biased importance sampling estimateof Delyon & Portier (2016) all favor dissimilar quadraturenodes, where dissimilarity is measured by a kernel. Ourwork falls within this last cluster.

1Univ. Lille, CNRS, Centrale Lille, UMR 9189 - CRIStAL,France 2INRIA Lille - Nord Europe, France 3DeepMind Paris.Correspondence to: Guillaume Gautier <[email protected]>.

Proceedings of the Workshop on Negative Dependence in MachineLearning at the 36 th International Conference on Machine Learn-ing, Long Beach, California, PMLR 97, 2019. Copyright 2019 bythe author(s).

We build on the particular approach of Bardenet & Hardy(2016) for Monte Carlo integration based on projectiondeterminantal point processes (DPPs, Hough et al., 2006;Kulesza & Taskar, 2012). DPPs are a repulsive distribu-tion over configurations of points; repulsiveness is againparametrized by a kernel. In a sense, DPPs are the kernelmachines of point processes.

Our contributions. First, we point out a mostly forgottenMonte Carlo estimator derived by Ermakov & Zolotukhin(1960) (EZ, 1960) that implicitly but crucially requires sam-pling from a DPP, more than a decade before Macchi (1975)even formalized DPPs! Second, we provide a simple proofof their result and survey the properties of the estimatorwith modern arguments. In particular, when the integrand isa linear combination of the eigenfunctions of the underly-ing DPP kernel, the corresponding Fourier-like coefficientscan be estimated with zero variance. From one sample ofthe corresponding DPP, perfect reconstruction of the signalis granted by solving a linear system. Third, we proposean efficient Python implementation for sampling exactly aparticular DPP called multivariate Jacobi ensemble. Thisimplementation allows to numerically investigate the behav-ior of the two Monte Carlo estimators derived by Bardenet& Hardy (2016) and Ermakov & Zolotukhin (1960), inregimes yet unexplored for any of the two. Our point isnot to compare DPP-based Monte Carlo estimators to thewide choice of numerical integration algorithms, but to geta fine understanding of their properties so as to fine-tunetheir design and guide theoretical developments.

2. Quadrature, DPPs, and the multivariateJacobi ensembles

2.1. Standard quadrature

Let µ(dx) = ω(x) dx be a positive Borel measure onX ⊂ Rd with finite mass and density w w.r.t. the Lebesguemeasure. This paper aims to compute integrals of the form∫f(x)µ(dx) for some test function f : X→ R. A quadra-

ture rule approximates such integrals as a weighted sum ofevaluations of f at some points x1, . . . , xN ⊂ X,∫

f(x)µ(dx) ≈N∑n=1

wnf(xn), (1)

with wn , wn(x1, . . . , xN ) ∈ R do not need to sum to one.

Page 2: On two ways to use determinantal point processes …...DPP( ;K N) defines a probability measure over sets since (3) is invariant by permutation, the orthonormality of the ˚ ks yields

On two ways to use DPPs for MC integration

Among the many quadrature designs mentioned in the intro-duction, we pay special attention to the textbook exampleof the (deterministic) Gauss-Jacobi rule. This scheme ap-plies to X , [−1, 1] with ω(x) , (1− x)a(1 + x)b wherea, b > −1. In this case, the nodes x1, . . . , xN are takento be the zeros of pN , the orthonormal Jacobi polynomialof degree N , and the weights wn , 1/K(xn, xn) withK(x, x) ,

∑N−1k=0 pk(x)2. In particular, it allows to per-

fectly integrate polynomials up to degree 2N − 1 (Davis& Rabinowitz, 1984, Section 2.7). In a sense, the DPPsof Bardenet & Hardy (2016) are a random, multivariategeneralization of Gauss-Jacobi quadrature, cf. Section 3.1.

Monte Carlo integration can be defined as random choicesof nodes in (1). Importance sampling, corresponds toi.i.d. nodes, while Markov chain Monte Carlo corresponds tonodes drawn from Markov chain (Robert & Casella, 2004).Finally, quasi-Monte Carlo (Dick & Pillichshammer, 2010)apply to µ uniform over a compact subset of Rd, and con-sists in constructing deterministic nodes that spread veryuniformly, as measured by the so-called discrepancy.

2.2. Projection DPPs

DPPs can be understood as a parametric class of pointprocesses, specified by a base measure µ and a kernelK : X × X → R. In this work, we take X = [−1, 1]d

and assume K to be continuous, positive semi-definite. Forthe resulting process to be well defined, it is necessary andsufficient that the kernel K has eigenvalues in [0, 1] (Sosh-nikov, 2000, Theorem 3). When the eigenvalues furtherbelong to 0, 1, we speak of (orthogonal) projection kerneland projection DPP. Projection DPPs have the practical fea-ture of producing samples with fixed cardinality, µ almostsurely, equal to the rank N of the kernel. More generally,they are the building blocks of DPPs. Indeed, under somegeneral assumptions, all DPPs are mixtures of projectionDPPs, (Hough et al., 2006). Hereafter, unless specificallystated, we consider projection DPPs.

One way to define a projection DPP with N points is to firsttake N orthonormal functions φ0, . . . , φN−1 in L2(µ), i.e.,〈φk, φ`〉L2(µ) ,

∫φk(x)φ`(x)µ(dx) = δk`. Then, con-

sider the kernel KN associated to the orthogonal projectorontoHN , spanφk, 0 ≤ k ≤ N − 1, i.e.,

KN (x, y) ,N−1∑k=0

φk(x)φk(y). (2)

The subset of x1, . . . ,xN ⊂ X is said to be is drawnfrom the projection DPP with base measure µ and ker-nel KN , denoted by x1, . . . ,xN ∼ DPP(µ,KN ), when(x1, . . . ,xN ) has joint distribution

1

N !det[KN (xp, xq)]

Np,q=1 µ

⊗N (dx). (3)

DPP(µ,KN ) defines a probability measure over sets since(3) is invariant by permutation, the orthonormality of theφks yields the normalization, see Appendix A.1 for details.

The repulsiveness of projection DPPs may be understoodgeometrically by rewriting (3) as

N∏n=1

1

N − (n− 1)

∥∥∥ΠKT

n−1KN (xn, ·)

∥∥∥2L2(µ)

µ(dxn), (4)

where ΠKT

n−1is the orthogonal projector onto the orthocom-

plement of spanKN (x`, ·)n−1`=1 in HN . Seeing (4) as abase×height formula, the joint distribution (3) is propor-tional to the squared volume of the parallelotope spannedby K(x1, ·), . . . ,K(xN , ·) in the feature spaceHN . Hence,the larger the volume, the more likely x1, . . . , xN co-occur.

Moreover, using the same normal equations as in standardlinear regression, the norms in (4) read∥∥∥ΠH

T

n−1KN (xn, ·)

∥∥∥2L2(µ)

(5)

=

KN (x1, x1), if n = 1,

KN (xn, xn)−Kn−1(xn)TK−1n−1Kn−1(xn), else,

where Kn−1(·) = (KN (x1, ·), . . . ,KN (xn−1, ·))T, andKn−1 = [KN (xp, xq)]

n−1p,q=1.

The unnormalized conditionals densities (5) also shows upin Gaussian processes (GPs, Rasmussen & Williams, 2006)as the incremental posterior variances in a GP model withkernel KN , giving yet another intuition for repulsiveness.

2.3. The multivariate Jacobi ensemble

We follow Bardenet & Hardy (2016) and consider eigenfunc-tions of the kernel in (2) to be the orthonormal polynomialsw.r.t.µ. In dimension d = 1, the resulting projection DPPis called an orthogonal polynomial ensemble (OPE, Konig,2004). When d > 1, orthonormal polynomials can stillbe uniquely defined by applying the Gram-Schmidt proce-dure to a set of monomials. However, there is no naturalorder on multivariate monomials: an ordering b : Nd → Nmust be picked before we apply Gram-Schmidt to the mono-mials in L2(µ). Bardenet & Hardy (2016, Section 2.1.3)consider multi-indices k , (k1, . . . , kd) ∈ Nd ordered bytheir maximum degree maxi k

i, and for constant maximumdegree, by the usual lexicographic order. We still denote themultivariate orthonormal polynomials by (φk)k∈Nd .

By multivariate OPE we mean the projection DPP withbase measure µ(dx) , ω(x) dx and orthogonal projectionkernel KN (x, y) ,

∑N−1b(k)=0 φk(x)φk(y). When the base

measure is separable, i.e., ω(x) = ω1(x1)× · · · × ωd(xd),multivariate orthonormal polynomials are products of uni-

Page 3: On two ways to use determinantal point processes …...DPP( ;K N) defines a probability measure over sets since (3) is invariant by permutation, the orthonormality of the ˚ ks yields

On two ways to use DPPs for MC integration

variate ones. and the kernel (2) reads

KN (x, y) =

N−1∑b(k)=0

d∏i=1

φiki(xi)φiki(y

i), (6)

with (φi`)`≥0 the orthonormal polynomials w.r.t.ωi(xi) dxi.For X = [−1, 1]d and ωi(xi) = (1− xi)ai(1 + xi)bi , withai, bi > −1, the resulting DPP is called a multivariateJacobi ensemble.

3. Monte Carlo with projection DPPs3.1. A natural estimator

Bardenet & Hardy (2016) used

I BHN (f) ,

N∑n=1

f(xn)

KN (xn,xn), f ∈ L1(µ), (7)

as an unbiased estimator of∫f(x)µ(dx), with variance

1

2

∫ (f(x)

KN (x, x)− f(y)

KN (y, y)

)2

|KN (x, y)|2µ(dx)µ(dy).

This variance clearly captures a notion of smoothness of fw.r.t. the kernel but its interpretation is not obvious.

For X = [−1, 1]d, the interest in multivariate Jacobi en-semble among DPPs comes from the fact that (7) can beunderstood as a (randomized) multivariate counterpart ofthe Gauss-Jacobi quadrature in Section 2.1. Besides, for fessentially C1, Bardenet & Hardy (2016, Theorem 2.7) alsoproved a CLT with faster-than-classical-Monte-Carlo decay,√N1+1/d

(I BHN (f)−

∫f(x)µ(dx)

)law−−−−→

N→∞N(0,Ω2

f,ω

),

(8)with Ω2

f,ω , 12

∑k∈Nd(k1 + · · · + kd)F fω

ωeq(k)2, where

Fg denotes the Fourier transform of g, and ωeq(x) ,

1/∏di=1 π

√1− (xi)2. In the fast CLT (8), the asymptotic

variance is governed by the smoothness of f since Ωf,ω is ameasure of the decay of its Fourier coefficients.

3.2. The Ermakov-Zolotukhin estimator

We first state the main result of Ermakov & Zolotukhin(1960), see also Evans & Swartz (2000, Section 6.4.3) andreferences therein. Using modern arguments and notation,we can provide a short and simple proof this results, cf.Appendix A.2. It is based on a generalization of the Cauchy-Binet formula established by Johansson (2006), see also Ap-pendix A.1. We apply the result of Ermakov & Zolotukhin(1960) to build an unbiased estimator of

∫f(x)µ(dx) which

comes with a practical variance.

Theorem 1. Let x1, . . . ,xN ∼ DPP(µ,KN ) as in (3).Then, the solution ofφ0(x1) . . . φN−1(x1)

......

φ0(xN ) . . . φN−1(xN )

y1

...yN

=

f(x1)...

f(xN )

(9)

is unique, µ-almost surely, with coordinates satisfying

E[yk]

= 〈f, φk−1〉L2(µ) (10)

Var[yk]

= ‖f‖L2(µ) −N−1∑k=0

〈f, φk−1〉L2(µ) (11)

where Φ denotes the feature matrix in (9) and Φφk−1,f isdefined as the matrix obtained by replacing the k-th columnof Φ by (f(x1), . . . , f(xN ))

T.

Several remarks are in order. The latter theorem shows thatsolving the (random) linear system (9), provides unbiased es-timates of the N Fourier-like coefficients (〈f, φk〉)N−1k=0 . Re-markably, these estimates have the exact same variance (11)equal to the residual

∑∞k=N 〈f, φk〉

2. The faster the decayof the coefficients, the smaller the variance. When f ∈ HN ,these estimators have zero variance: f can be reconstructedperfectly from only one sample of DPP(µ,KN ).

In the setting of multivariate Jacobi ensemble described inSection 2.3, the first orthonormal polynomial φ0 is constant.Hence, a direct application of Theorem 1 yields

I EZN (f) , µ

([−1, 1]

d) 12

det Φφ0,f (x1:N )

det Φ(x1:N )(12)

as an unbiased estimator of∫f(x)µ(dx), which can be

viewed as a quadrature rule, cf. Appendix A.3. Unlike thevariance of I BH

N (f) in (3.1), the variance of I EZN (f) clearly

reflects the accuracy of the approximation of f by its pro-jection onto HN In particular, it allows to integrate andinterpolate polynomials up to “degree” b−1(N − 1), per-fectly. Nonetheless, its limiting theoretical properties, like aCLT, look hard to establish. In particular, the dependence ofeach quadrature weight on all quadrature nodes makes theestimator a peculiar object that does not fit the assumptionsof traditional CLTs for DPPs (Soshnikov, 2000).

3.3. Sampling

To perform Monte Carlo integration with DPPs, it is crucialto sample the points and evaluate the weights efficiently.Except for some specific instances, exact sampling fromcontinuous projection DPPs requires the spectral decompo-sition of the kernel (2) before applying the chain rule (4)(Hough et al., 2006). The main challenge is to find goodproposal distributions to efficiently sample the successiveconditionals (Lavancier et al., 2012).

Page 4: On two ways to use determinantal point processes …...DPP( ;K N) defines a probability measure over sets since (3) is invariant by permutation, the orthonormality of the ˚ ks yields

On two ways to use DPPs for MC integration

We focus on sampling the multivariate Jacobi ensemble,with parameters ai, bi ∈ [− 1

2 ,12 ]. In dimension d = 1, it can

be sampled at cost O(N2) by computing the eigenvaluesof a random tridiagonal matrix (Killip & Nenciu, 2004,Theorem 2, β = 2). For d ≥ 2, we follow Bardenet & Hardy(2016) and use the same proposal distribution ωeq(x) dx andrejection bound to sample each conditional. The rejectionconstant is derived from a result of (Chow et al., 1994) onJacobi polynomials. See Appendix A.4 for more details.

We remodeled the original implementation1 of Bardenet &Hardy (2016) in a more Pythonic way. Notably, when eval-uating the kernel (6), we paid special attention to avoidingunnecessary evaluations of the univariate orthogonal Jacobipolynomials by propagation of three-term recurrence rela-tions they satisfy. Comparatively, sampling N = 100 pointsin dimension d = 1, 2 was counted in minutes, now it takesmilliseconds. In Appendix A.4, we display a 2D sample ofsize N = 1000, obtained in approximately 7 min comparedto hours previously, on a modern laptop.

4. Empirical investigationAppendix B collects the results of the following experimentsas well as further experiments on non smooth functions.

4.1. The bump experiment

Bardenet & Hardy (2016, Section 3) illustrate the behaviorof I BH

N and its CLT (8) on a unimodal, smooth bump func-tion (ε = 0.05). The expected variance decay is of order1/N1+1/d. We successfully reproduce their experiment inFigure 1 for larger N , and compare with the behavior ofI EZN . In short, I EZ

N dramatically outperforms I BHN in d ≤ 2,

with surprisingly fast empirical convergence rates. Whend ≥ 3, performance decreases, and I BH

N shows both fasterand more regular variance decay.

As to whether a CLT for I EZN could hold, we performed

Kolmogorov-Smirnov tests for N = 300, see Appendix B.1.This yielded small p-values across dimensions, from 0.03to 0.24. This is compared to the same p-values for I BH

N ,which range from 0.60 to 0.99. The lack of normality of theEZ estimator is partly due to a few outliers. Where theseoutliers come from is left for future work; ill-conditioningof the linear system (9) is an obvious candidate.

4.2. Integrating sums of eigenfunctions

To test the variance decay of I EZN (f) prescribed by Theo-

rem 1, we consider functions of the form

f(x) =

Nmodes−1∑b(k)=0

1

b(k) + 1φk(x). (13)

1https://github.com/rbardenet/dppmc

That is to say, the function f can be either fully (Nmodes ≤N ) or partially (Nmodes > N ) decomposed in the eigenbasisof the kernel. In both cases, we let N vary from 10 to 100and the dimension d from 1 to 4.

In the first setting, we set Nmodes = 70. Thus, N eventuallyreaches the number of functions used to build f in (13),after what I EZ

N is an exact estimator in any dimension, seeFigure 1. The second setting has Nmodes = N + 1, so thatthe number of points N is never enough for the variance(11) to be zero. The corresponding 1/N2 variance decayprescribed by Theorem 1 can be observed in Appendix B.2.

5. ConclusionErmakov & Zolotukhin (EZ, 1960) proposed a non-obviousunbiased Monte Carlo estimator using projection DPPs. Itrequires solving a linear system, which in turn involvesevaluating both the N eigenfunctions of the correspond-ing kernel and the integrand at the N points of the DPPsample. This is yet another connection between DPPs andlinear algebra. In fact, solving this linear system providesunbiased estimates of the Fourier-like coefficients of theintegrand f with each of the N eigenfunctions of the DPPkernel. Remarkably, these estimators have identical variancemeasuring the accuracy of the approximation of f by its pro-jection onto these eigenfunctions. With modern arguments,we have provided a much shorter proof of these propertiesthan in the original work of (Ermakov & Zolotukhin, 1960).Beyond this, little is known on the EZ estimator. Whilecoming with a less interpretable variance, the more directestimator proposed by Bardenet & Hardy (BH, 2016) has anintrinsic connection with the classical Gauss quadrature andfurther enjoys stronger theoretical properties when usingmultivariate Jacobi ensemble.

Our experiments highlight the key features of both esti-mators when the underlying DPP is a multivariate Jacobiensemble, and further demonstrate the known propertiesof the BH estimator in yet unexplored regimes. AlthoughEZ shows a surprisingly fast empirical convergence ratefor d ≤ 2, its behavior is more erratic for d ≥ 3. Ill-conditioning of the linear system is a potential source ofoutliers in the distribution of the estimator. Regulariza-tion may help but would introduce a stability/bias trade-off.More generally, EZ seems worth investigating for integra-tion or even interpolation tasks where the function is knownto be decomposable in the eigenbasis of the kernel, i.e., in asetting similar to the one of Bach (2017). Finally, the newimplementation of an exact sampler for multivariate Jacobiensemble unlocks more large-scale empirical investigationsand asks for more theoretical work. The associated code isavailable in the DPPy toolbox of Gautier et al. (2018).

https://github.com/guilgautier/DPPy/notebooks

Page 5: On two ways to use determinantal point processes …...DPP( ;K N) defines a probability measure over sets since (3) is invariant by permutation, the orthonormality of the ˚ ks yields

On two ways to use DPPs for MC integration

101 102 N

10 19

10 15

10 11

10 7

10 3ar

BH -2.0, 0.99EZ -15.4, 0.97

(a) d = 1

102 N

10 6

10 5

10 4

10 3

ar

BH -1.5, 0.98EZ -3.1, 0.95

(b) d = 2

102 N

10 4

10 3

ar

BH -1.5, 0.99EZ -1.2, 0.80

(c) d = 3

102 N

10 4

10 3ar

BH -1.2, 0.96EZ -1.0, 0.64

(d) d = 4

10 2

ar

101 102N10 30

10 29

10 28

BHEZ

(e) d = 1

10 3

10 2

10 1ar

101 102N10 30

10 29

10 28

BHEZ

(f) d = 2

10 2

10 1

100ar

101 102N10 29

10 28

BHEZ

(g) d = 3

10 2

10 1

100ar

101 102N10 29

10 28 BHEZ

(h) d = 4

Figure 1. (a)-(d) cf. Section 4.1 the numbers in the legend are the slope and R2 (e)-(h) cf. Section 4.2.

ACKNOWLEDGMENTS

Many thanks to Mathieu Gerber of Univ. Bristol, UK fordigging up this result of Ermakov & Zolotukhin (1960)from his human memory. We acknowledge funding by Eu-ropean CHIST-ERA project DELTA, the French Ministryof Higher Education and Research, the Nord-Pas-de-CalaisRegional Council, Inria and Otto-von-Guericke-UniversitatMagdeburg associated-team north-European project Allo-cate, and French National Research Agency projects ExTra-Learn (n.ANR-14-CE24-0010-01) and BoB (n.ANR-16-CE23-0003).

ReferencesBach, F. On the Equivalence between Kernel Quadra-

ture Rules and Random Feature Expansions. Jour-nal of Machine Learning Research, 18(21):1–38, 2017.arXiv:1502.06800.

Bach, F., Lacoste-Julien, S., and Obozinski, G. On theEquivalence between Herding and Conditional GradientAlgorithms. In International Conference on MachineLearning (ICML), 2012. arXiv:1203.4523.

Bardenet, R. and Hardy, A. Monte Carlo with DeterminantalPoint Processes. ArXiv e-prints, 2016. arXiv:1605.00361.

Briol, F.-X., Oates, C. J., Girolami, M., and Osborne, M. A.Frank-Wolfe Bayesian Quadrature: Probabilistic Integra-tion with Theoretical Guarantees. In Advances in NeuralInformation Processing Systems (NeurIPS), pp. 1162–1170, jun 2015. arXiv:1506.02681.

Chen, Y., Welling, M., and Smola, A. Super-Samples fromKernel Herding. In Conference on Uncertainty in Arti-ficial Intelligence (UAI), pp. 109–116. AUAI Press, mar2010. arXiv:1203.3472. ISBN 9780974903965.

Chow, Y., Gatteschi, L., and Wong, R. A Bernstein-typeinequality for the Jacobi polynomial. Proceedings of theAmerican Mathematical Society, 121(3):703–703, 1994.

Davis, P. J. and Rabinowitz, P. Methods of numerical inte-gration. Academic Press. 1984. ISBN 9780122063602.

Delyon, B. and Portier, F. Integral approximation by ker-nel smoothing. Bernoulli, 22(4):2177–2208, nov 2016.arXiv:1409.0733.

Derezınski, M. Fast determinantal point processes viadistortion-free intermediate sampling. ArXiv e-prints,2019. arXiv:1811.03717v2.

Dick, J. and Pillichshammer, F. Digital nets and sequences: discrepancy and quasi-Monte Carlo integration. Cam-bridge University Press. 2010. ISBN 9780521191593.

Ermakov, S. M. and Zolotukhin, V. G. Polynomial Ap-proximations and the Monte-Carlo Method. Theory ofProbability & Its Applications, 5(4):428–431, jan 1960.

Evans, M. and Swartz, T. Approximating integrals viaMonte Carlo and deterministic methods. Oxford Univer-sity Press. 2000. ISBN 9780198502784.

Gautier, G., Bardenet, R., and Valko, M. DPPy: SamplingDeterminantal Point Processes with Python. ArXiv e-prints, 2018. arXiv:1809.07258.

Gautschi, W. How sharp is Bernstein’s Inequality for Jacobipolynomials? Electronic Transactions on NumericalAnalysis, 36:1–8, 2009.

Hough, J. B., Krishnapur, M., Peres, Y., and Virag, B.Determinantal Processes and Independence. In Prob-ability Surveys, volume 3, pp. 206–229. The Institute ofMathematical Statistics and the Bernoulli Society, 2006.arXiv:math/0503110.

Page 6: On two ways to use determinantal point processes …...DPP( ;K N) defines a probability measure over sets since (3) is invariant by permutation, the orthonormality of the ˚ ks yields

On two ways to use DPPs for MC integration

Huszar, F. and Duvenaud, D. Optimally-Weighted Herdingis Bayesian Quadrature. In Conference on Uncertaintyin Artificial Intelligence (UAI), pp. 377–386. AUAI Press,apr 2012. arXiv:1204.1664. ISBN 9780974903989.

Johansson, K. Random matrices and determinantal pro-cesses. Les Houches Summer School Proceedings, 83(C):1–56, 2006.

Killip, R. and Nenciu, I. Matrix models for circular ensem-bles. International Mathematics Research Notices, 2004(50):2665, 2004. arXiv:math/0410034.

Konig, W. Orthogonal polynomial ensembles in prob-ability theory. Probab. Surveys, 2:385–447, 2004.arXiv:math/0403090.

Kulesza, A. and Taskar, B. Determinantal Point Processesfor Machine Learning. Foundations and Trends in Ma-chine Learning, 5(2-3):123–286, 2012. arXiv:1207.6083.

Launay, C., Galerne, B., and Desolneux, A. Exact Samplingof Determinantal Point Processes without Eigendecom-position. ArXiv e-prints, feb 2018. arXiv:1802.08429.

Lavancier, F., Møller, J., and Rubak, E. Determinantalpoint process models and statistical inference : Extendedversion. Journal of the Royal Statistical Society. SeriesB: Statistical Methodology, 77(4):853–877, may 2012.arXiv:1205.4818.

Liu, Q. and Lee, J. D. Black-Box Importance Sampling.In Internation Conference on Artificial Intelligence andStatistics (AISTATS), 2017. arXiv:1610.05247.

Macchi, O. The coincidence approach to stochastic pointprocesses. Advances in Applied Probability, 7(01):83–122, mar 1975.

Oates, C. J., Girolami, M., and Chopin, N. Control func-tionals for Monte Carlo integration. Journal of the RoyalStatistical Society: Series B (Statistical Methodology), 79(3):695–718, jun 2017. arXiv:1410.2392.

O’Hagan, A. Bayes–Hermite quadrature. Journal of Statis-tical Planning and Inference, 29(3):245–260, nov 1991.

Poulson, J. High-performance sampling of generic De-terminantal Point Processes. ArXiv e-prints, apr 2019.arXiv:1905.00165.

Rasmussen, C. E. and Williams, C. K. I. Gaussian pro-cesses for machine learning. MIT Press. 2006. ISBN026218253X.

Robert, C. P. The Bayesian choice : from decision-theoreticfoundations to computational implementation. Springer.2007. ISBN 9780387952314.

Robert, C. P. and Casella, G. Monte Carlo statisticalmethods. Springer-Verlag New York. 2004. ISBN9781441919397.

Soshnikov, A. Determinantal random point fields. Rus-sian Mathematical Surveys, 55(5):923–975, feb 2000.arXiv:math/0002099.

Page 7: On two ways to use determinantal point processes …...DPP( ;K N) defines a probability measure over sets since (3) is invariant by permutation, the orthonormality of the ˚ ks yields

On two ways to use DPPs for MC integration

A. MethodologyA.1. The generalized Cauchy-Binet formula: the modern argument

Johansson (2006, Section 2.2) developed a natural way to build DPPs associated to projection (potentially non hermitian)kernels. In this part, we focus on the generalization of the Cauchy-Binet formula (Johansson, 2006, Proposition 2.10). Itsusefulness is twofold for our purpose. First, it serves to justify the fact that the normalization constant of the joint distribution(3) is one, i.e., it is indeed a probability distribution. Second, we use it as a modern and simple argument to prove the resultof Ermakov & Zolotukhin (1960), cf. Theorem 1. An extended version of the proof is given in Appendix A.2.

Lemma A. (Johansson, 2006, Proposition 2.10) Let (X,B, µ) be a measurable space and consider measurable functionsφ0, . . . , φN−1 and ψ0, . . . , ψN−1, such that φkψ` ∈ L1(µ). Then,

det(〈φk, ψ`〉L2(µ)

)Nk,`=1

=1

N !

∫det Φ(x1:N ) det Ψ(x1:N )µ⊗N (dx), (A.1)

where

Φ(x1:N ) =

φ0(x1) . . . φN−1(x1)...

...φ0(xN ) . . . φN−1(xN )

and Ψ(x1:N ) =

ψ0(x1) . . . ψN−1(x1)...

...ψ0(xN ) . . . ψN−1(xN )

A.2. Proof of Theorem 1

First, we recall the result of Ermakov & Zolotukhin (1960), cf. Theorem 1. Then, we provide a modern proof based on thegeneralization of the Cauchy-Binet formula, cf. Lemma A, where we exploit the orthonormality of the eigenfunctions of thekernel.

Theorem B. Consider f ∈ L2(µ) together with N orthonormal functions φ0, . . . , φN−1 ∈ L2(µ):

〈φk, φ`〉L2(µ) ,∫φk(x)φ`(x)µ(dx) = δk`, ∀0 ≤ k, ` ≤ N − 1. (A.2)

Let x1, . . . ,xN ∼ DPP(µ,KN ) with KN (x, y) =∑N−1k=0 φ(x)φ(y). That is to say (x1, . . . ,xN ) has joint distribution

1

N !det[KN (xp, xq)]

Np,q=1 µ

⊗N (dx). (A.3)

Then, the solution of φ0(x1) . . . φN−1(x1)...

...φ0(xN ) . . . φN−1(xN )

y1

...yN

=

f(x1)...

f(xN )

(A.4)

is unique, µ-almost surely and the coordinates of the solution vector, namely

yk =det Φφk−1,f (x1:N )

det Φ(x1:N ), (A.5)

satisfy

E[yk]

= 〈f, φk−1〉L2(µ), and Var[yk]

= ‖f‖2L2(µ) −N−1∑`=0

〈f, φ`〉2L2(µ), (A.6)

where Φ(x1:N ) denotes the feature matrix in (A.4) and Φφk−1,f (x1:N ) is defined as the matrix obtained by replacing thek-th column of Φ(x1:N ) by (f(x1), . . . , f(xN ))

T.

Proof of Theorem B. First, the joint distribution (A.3) of (x1, . . . ,xN ) is proportional to (det Φ(x1:N ))2µ⊗N (x). Thus,

det Φ(x1:N ) 6= 0, µ-almost surely. Hence, the matrix Φ(x1:N ) defining the linear system (A.4) is invertible, µ-almostsurely.

The expression of the coordinates (A.5) follows from Cramer’s rule.

Page 8: On two ways to use determinantal point processes …...DPP( ;K N) defines a probability measure over sets since (3) is invariant by permutation, the orthonormality of the ˚ ks yields

On two ways to use DPPs for MC integration

Then, we treat the case k = 1, the others follow the same lines. The proof relies on Lemma A where we exploit theorthonormality of the φks (A.2). The expectation in (A.6) reads

E[

det Φφ0,f (x1:N )

det Φ(x1:N )

](A.3)=

1

N !

∫det Φφ0,f (x1:N ) det Φ(x1:N )µ⊗N (dx)

(A.1)= det

〈f, φ0〉2L2(µ)

(〈f, φ`〉2L2(µ)

)N−1`=1(

〈f, φ0〉2L2(µ)

)N−1k=1

(〈φk, φ`〉2L2(µ)

)N−1k,`=1

(A.2)= det

(〈f, φ0〉2L2(µ)

(〈f, φ`〉2L2(µ)

)N−1`=1

0N−1,1 IN−1

)= 〈f, φ0〉2L2(µ). (A.7)

Similarly, the second moment reads

E

[(det Φφ0,f (x1:N )

det Φ(x1:N )

)2]

(A.3)=

1

N !

∫det Φφ0,f (x1:N ) det Φφ0,f (x1:N )µ⊗N (dx)

(A.1)= det

〈f, f〉2L2(µ)

(〈f, φ`〉2L2(µ)

)N−1`=1(

〈f, φk〉2L2(µ)

)N−1k=1

(〈φk, φ`〉2L2(µ)

)N−1k,`=1

(A.2)= det

‖f‖2L2(µ)

(〈f, φ`〉2L2(µ)

)N−1`=1(

〈f, φk〉2L2(µ)

)N−1k=1

IN−1

= ‖f‖2L2(µ) −

N−1∑k=1

〈f, φk〉2L2(µ). (A.8)

Finally, the variance in (A.6) = (A.8) - (A.7)2.

A.3. EZ estimator as a quadrature rule

In this part, we consider Theorem B in the setting where one of the eigenfunctions of the kernel, say φ0 is constant. In thiscase, we show that the EZ estimator defined to estimate

∫f(x)µ(dx) can be seen as a quadrature rule in the sense of (1),

with weights ωn that sum to µ([−1, 1]

d). This is a non obvious fact, judging from the expression (12) of the estimator.Proposition 1. Consider φ0 constant in Theorem B. Then, solving the corresponding linear system (A.4) allows to construct

I EZN (f)

(12)=

õ(

[−1, 1]d) det Φφ0,f (x1:N )

det Φ(x1:N )(A.9)

as an unbiased estimator of∫f(x)µ(dx), with variance equal to the variance in (A.6)×µ

([−1, 1]

d)

. In particular it canbe seen as a random quadrature rule (1),

I EZN (f) =

N∑n=1

ωn(x1:N )f(xn) ≈∫f(x)µ(dx) (A.10)

such that∑Nn=1 ωn(x1:N ) = µ

([−1, 1]

d)

.

Proof. Take φ0 constant. Since φ0 has unit norm, cf. (A.2), it is straightforward to see that

φ0 =1√

µ(

[−1, 1]d)

Page 9: On two ways to use determinantal point processes …...DPP( ;K N) defines a probability measure over sets since (3) is invariant by permutation, the orthonormality of the ˚ ks yields

On two ways to use DPPs for MC integration

so that (A.9) can be written

I EZN (f) = µ

([−1, 1]

d) det Φφ0,f (x1:N )

det Φφ0,1(x1:N ).

Expanding the numerator w.r.t. the first column yields

I EZN (f) =

N∑n=1

f(xn) (−1)1+n det(φk(xp))N−1,Nk=1,p=16=n

µ(

[−1, 1]d)

det Φφ0,1(x1:N )︸ ︷︷ ︸,ωn(x1:N )

·

In particular, there is a priori no reason for the weights to be nonnegative. Finally,

N∑n=1

ωn(x1:N ) =µ(

[−1, 1]d)

(((((((det Φφ0,1(x1:N )

N∑n=1

(−1)1+n det(φk(xp))N−1,Nk=1,p=1 6=n︸ ︷︷ ︸

=((((((detΦφ0,1

(x1:N )

= µ(

[−1, 1]d).

This concludes the proof.

A.4. Sampling multivariate Jacobi ensembles

In this part, we review briefly the main techniques for DPP sampling before we develop our method to generate samplesfrom the multivariate Jacobi ensemble, as defined in Section 2.3. As an illustration, Figure A.1 displays a sample of a twodimensional Jacobi ensemble with N = 1000 points where the parameters a1, b1, a2, b2 were drawn i.i.d. uniformly on[−1/2, 1/2].

In both finite and continuous cases, except for some specific instances, exact sampling from DPPs (with symmetric kernel)usually requires the spectral decomposition of the kernel before applying the chain rule (4), see, e.g., Hough et al. (2006);Kulesza & Taskar (2012). In the finite case, i.e., X = 1, . . . ,M, sampling projection DPPs does not require theeigendecomposition of the kernel, and the chain rule costs O(MN2), where N denotes the rank of kernel. Otherwise, thereis a preprocessing cost of order O(M3) which may become impractical for large M , just like other kernel methods. Thesame cubic cost applies to Cholesky-based samplers, see, e.g., Launay et al. (2018), or Poulson (2019) who can also treatnon symmetric kernels. Note that, this cubic cost can be reduced when the kernel is given in a factored form (Kulesza &Taskar, 2012; Derezınski, 2019).

Unlike the discrete case, sampling from continuous DPPs, even projection ones remains challenging. The realizations ofprojection DPPs are usually generated by applying the chain rule (4), where the conditionals are sampled using rejectionsampling. The main challenge is to find good proposal distributions to efficiently sample the successive conditionals(Lavancier et al., 2012). In this work, we take X = [−1, 1]d and focus on sampling the multivariate Jacobi ensemble, cf. 2.3,for a base measure with parameters ai, bi ∈ [− 1

2 ,12 ].

In dimension d = 1, to sample the univariate Jacobi ensemble, with base measure µ(dx) = (1 − x)a(1 + x)b dx wherea, b > −1, we use the random tridiagonal matrix model of Killip & Nenciu (2004, Theorem 2). That is to say, computingthe eigenvalues of a properly randomized tridiagonal matrix allows to get a sample of this continuous projection DPP at costO(N2)!

For d ≥ 2, we follow Bardenet & Hardy (2016, Section 3) who proposed to use the chain rule (4) to sample from themultivariate Jacobi ensemble with base measure µ(dx) = ω(x) dx, where

ω(x) =

d∏i=1

(1− xi)ai

(1− xi)bi

,with∣∣ai∣∣, ∣∣bi∣∣ ≤ 1

2· (A.11)

To that end, we use the same proposal distribution and rejection bound to sample from each of the conditionals (4). Thedensity (w.r.t. Lebesgue) of the proposal distribution writes

ωeq(x) =

d∏i=1

1

π√

1− (xi)2· (A.12)

Page 10: On two ways to use determinantal point processes …...DPP( ;K N) defines a probability measure over sets since (3) is invariant by permutation, the orthonormality of the ˚ ks yields

On two ways to use DPPs for MC integration

The rejection constant is derived after by successive applications of the following result on Jacobi polynomials derived byChow et al. (1994).

Proposition 2. (Gautschi, 2009, Equation 1.3) Let (φk)k≥0 be the orthonormal polynomials w.r.t. the measure (1− x)a(1 +x)b dx with |a| ≤ 1

2 , |b| ≤12 . Then, for any x ∈ [−1, 1] and k ≥ 0,

π(1− x)a+12 (1 + x)b+

12φk(x)2 ≤ 2 Γ(k + a+ b+ 1) Γ(k + max(a, b) + 1)

k! (k + a+b+12 )2max(a,b) Γ(k + min(a, b) + 1)

· (A.13)

The domination of the acceptance ratio, i.e., the ratio of the n-th conditional density in (4) over the proposal density (A.12)is computed as follows

KN (x, x)−Kn−1(x)TK−1n−1Kn−1(x)

N − (n− 1)ω(x)× 1

ωeq(x)

≤ 1

N − (n− 1)

K(x, x)ω(x)

weq(x)

(6)=

(A.12)

1

N − (n− 1)

N−1∑b(k)=0

d∏i=1

π(1− xi)ai+ 1

2 (1 + xi)bi+ 1

2φiki(xi)2. (A.14)

Finally, each of the terms that appear in (A.14) can be bounded using the following recipe:

1. For ki > 0, we use the bound (A.13)

2. For ki = 0, the domination of the left hand side (LHS) of (A.13) is not tight enough (= 2), so we proceed as follows.In this case, φ0 is constant equal to

(∫(1− x)a(1 + x)b dx

)−1/2and since |a|, |b| ∈ [−1/2, 1/2] we upper bound

(1− x)a+1/2(1 + x)b+1/2 by the evaluation at its mode.

Point 2. is crucial to tighten the rejection constant. Indeed, because of the choice of the ordering b (cf. Section 2.3), thenumber of times that φi0 appears in (A.14) increases with the dimension. Hence, the tighter the bound on the LHS of (A.13)for k = 0 the best the rejection constant.

In Figure A.2 we illustrate the following observations. We note that computing the acceptance ratio requires to propagatethese recurrence relations up to order d

√N . Thus, for a given N , the larger the dimension, the smaller the depth of the

recurrence. This could hint that, evaluating the kernel (6) becomes cheaper as d increases. However, the rejection rate alsoincreases, so that in practice, it is not cheaper to sample in larger dimensions because the number of rejections dominates. Inthe particular case of dimension d = 1, samples are generated using the fast and rejection-free tridiagonal matrix model ofKillip & Nenciu (2004, Theorem 2). This grants huge time savings compared to the acceptance-rejection method. Withoutit, sampling N points in dimension d = 1 would take more time than in larger dimension, although the associated rejectionconstants are smaller, as it can be seen in Figure 2(a) and Figure 2(b).

Page 11: On two ways to use determinantal point processes …...DPP( ;K N) defines a probability measure over sets since (3) is invariant by permutation, the orthonormality of the ˚ ks yields

On two ways to use DPPs for MC integration

1.0 0.5 0.0 0.5 1.0

1.0

0.5

0.0

0.5

1.0

a = b = 12

(a) The large-N limit of the marginals, known to be ωeq, is plotted on top of the empiricalhistogram on each marginal plot.

1.0 0.5 0.0 0.5 1.0

1.0

0.5

0.0

0.5

1.0

a1 0.343b1 0.038

a2 0.347b2 0.279

(b) Same sample as in Figure 1(a) but the disk centered at xn has an area proportional to theweight 1/KN (xn,xn) as in I BH

N (7).

Figure A.1. A sample of a 2D Jacobi ensemble withN = 1000 points and parameters a1, b1, a2, b2 drawn i.i.d. uniformly on [−1/2, 1/2].

Page 12: On two ways to use determinantal point processes …...DPP( ;K N) defines a probability measure over sets since (3) is invariant by permutation, the orthonormality of the ˚ ks yields

On two ways to use DPPs for MC integration

50 100 150 200 250 300N0

20

40

60

80

100<t>(s)

1 wo tri1 w tri234

(a) 〈time〉 to get one sample

50 100 150 200 250 300N0

5000

10000

15000

20000<#rej>

1 wo tri1 w tri234

(b) 〈#rejections〉

Figure A.2. The colors and numbers correspond to the dimension. (a)-(b) all parameters equal −1/2. For d = 1, the tridiagonal model(tri) of Killip & Nenciu offers tremendous savings, without it is cheaper to get a sample in larger dimension. The number of rejectionsgrows as N2d.

Page 13: On two ways to use determinantal point processes …...DPP( ;K N) defines a probability measure over sets since (3) is invariant by permutation, the orthonormality of the ˚ ks yields

On two ways to use DPPs for MC integration

50 100 150 200N102

103

Crej

1/2rand

1/2

(a) Larger Figure ??, d = 1, 2, 3, 4

50 100 150 200N102

103

104

105

106

107Crej

1/2rand

1/2

(b) d = 1, 5, 10, 15

Figure A.3. Rejection bounds. Given the proposal distribution ωeq(x) dx, it is not surprising to see that the rejection procedure is the moreefficient when the base measure µ(dx) = ωeq(x) dx, i.e., coefficients = − 1

2, than for larger coefficients. The larger the coefficients the

greater the gap.

Page 14: On two ways to use determinantal point processes …...DPP( ;K N) defines a probability measure over sets since (3) is invariant by permutation, the orthonormality of the ˚ ks yields

On two ways to use DPPs for MC integration

B. ExperimentsB.1. Reproducing the bump example

In Section 4.1, we reproduce the experiment of Bardenet & Hardy (2016, Section 3) where they illustrate the behavior ofI BHN on a unimodal, smooth bump function:

f(x) =

d∏i=1

exp

(− 1

1− ε− (xi)2

)1[−1+ε,1−ε](x

i). (B.1)

For each value of N , we sample 100 times from the same multivariate Jacobi ensembles with i.i.d. uniform parameters on[−1/2, 1/2], compute the resulting 100 values of each estimator, and plot the two resulting sample variances. In addition, inFigure B.2 we test the potential hope for a CLT for I EZ

N and compare with I BHN for which the CLT (8) holds, in the regime

N = 300.

101 102 N

10 19

10 15

10 11

10 7

10 3ar

BH -2.0, 0.99EZ -15.4, 0.97

(a) d = 1

102 N

10 6

10 5

10 4

10 3

ar

BH -1.5, 0.98EZ -3.1, 0.95

(b) d = 2

102 N

10 4

10 3

ar

BH -1.5, 0.99EZ -1.2, 0.80

(c) d = 3

102 N

10 4

10 3ar

BH -1.2, 0.96EZ -1.0, 0.64

(d) d = 4

Figure B.1. Reproducing the bump function (ε = 0.05) experiment of Bardenet & Hardy (2016), cf. Section 4.1. The expected variancedecay of order 1/N1+1/d is observed for BH. For d = 1, EZ has almost no variance for N ≥ 100: the bump function is extremely wellapproximated by a polynomials of degree N ≥ 100.

Page 15: On two ways to use determinantal point processes …...DPP( ;K N) defines a probability measure over sets since (3) is invariant by permutation, the orthonormality of the ˚ ks yields

On two ways to use DPPs for MC integration

0.4425 0.4450 0.4475 0.45000

100

200

300 BH pval = 0.916EZ

(a) d = 1

0.17 0.18 0.19 0.200

50

100

150BH pval = 0.590EZ pval = 0.039

(b) d = 2

0.06 0.07 0.08 0.090

25

50

75

100 BH pval = 0.799EZ pval = 0.232

(c) d = 3

0.00 0.05 0.10 0.150

20

40

60

80 BH pval = 0.992EZ pval = 0.239

(d) d = 4

Figure B.2. Histogram of 100 independent estimates I BHN and I EZ

N of the integral of the bump function (ε = 0.05) with N = 300 andassociated p-value of Kolmogorov-Smirnov test, cf. Section 4.1. The fluctuations of BH confirm to be Gaussian (cf. CLT (8)). (a) thebump function is extremely well approximated by a polynomial of degree 300 hence I EZ

N has almost no variance. (b)-(c)-(d) few outliersseem to break the potential Gaussianity of I EZ

N (f). (d) I EZN (f) does not preserves the sign of the integrand.

Page 16: On two ways to use determinantal point processes …...DPP( ;K N) defines a probability measure over sets since (3) is invariant by permutation, the orthonormality of the ˚ ks yields

On two ways to use DPPs for MC integration

B.2. Integrating sums of eigenfunctions

In the next series of experiments, we are mainly interested in testing the variance decay of I EZN (f) prescribed by Theorem 1.

To that end, we consider functions of the form given by (13), i.e.,

f(x) =

Nmodes−1∑b(k)=0

1

b(k) + 1φk(x), (B.2)

whose integral w.r.t.µ is to be estimated based on realizations of the multivariate Jacobi ensemble with kernel KN (x, y) =∑N−1b(k)=0 φk(x)φk(y) where N 6= Nmodes a priori. This means that the function f can be either fully (Nmodes ≤ N ) or

partially (Nmodes > N ) decomposed in the eigenbasis of the kernel. In both cases, we let the number of points N used tobuild the two estimators vary from 10 to 100 in dimensions d = 1 to 4.

In the first setting, we set Nmodes = 70. Thus, N eventually reaches the number of functions used to build f in (13), afterwhat I EZ

N is an exact estimator. For each dimension d, Figure B.3 indeed shows the drop in the variance of I EZN once

the number of points of the DPP hits the threshold N = Nmodes. This is in perfect agreement with Theorem 1: oncef ∈ HNmodes ⊆ HN , the variance (11) is zero.

The second setting has Nmodes = N + 1, so that the number of points N is never enough for the variance (11) to be zero.As N increases the contribution of the extra mode φb−1(N) in (13) decreases as 1

N . Hence, from Theorem 1 we expect avariance decay of order 1

N2 , which we observe in practice, cf. Figure B.4.

10 2

ar

101 102N10 30

10 29

10 28

BHEZ

(a) d = 1

10 3

10 2

10 1ar

101 102N10 30

10 29

10 28

BHEZ

(b) d = 2

10 2

10 1

100ar

101 102N10 29

10 28

BHEZ

(c) d = 3

10 2

10 1

100ar

101 102N10 29

10 28 BHEZ

(d) d = 4

Figure B.3. Comparison of I BHN and I EZ

N integrating a finite sum of 70 eigenfunctions of the DPP kernel as in (13), cf. Section 4.2.

Page 17: On two ways to use determinantal point processes …...DPP( ;K N) defines a probability measure over sets since (3) is invariant by permutation, the orthonormality of the ˚ ks yields

On two ways to use DPPs for MC integration

101 102N

10 4

10 3

10 2ar

BH -2.1, 0.99EZ -2.1, 0.97

(a) d = 1

101 102N10 4

10 3

10 2

10 1ar

BH -1.3, 0.98EZ -1.9, 0.87

(b) d = 2

101 102N10 3

10 2

10 1

ar

BH -1.1, 0.98EZ -1.9, 0.94

(c) d = 3

101 102N10 3

10 2

10 1

100ar

BH -0.8, 0.91EZ -2.1, 0.96

(d) d = 4

Figure B.4. Comparison of I BHN and I EZ

N for a linear combination of N + 1 eigenfunctions of the DPP kernel as in (13), cf. Section 4.2.

Page 18: On two ways to use determinantal point processes …...DPP( ;K N) defines a probability measure over sets since (3) is invariant by permutation, the orthonormality of the ˚ ks yields

On two ways to use DPPs for MC integration

B.3. Further experiments

In Appendices B.3.1-B.3.4 we test the robustness of both BH and EZ estimators, when applied to functions presentingdiscontinuities or which do not belong to the span of the eigenfunctions of the kernel. Although the conditions of the CLT(8) associated to I BH are violated, the corresponding variance decay is smooth but not as fast. For I EZ the performancedeteriorate with the dimension. Indeed, the cross terms arising from the Taylor expansion of the different functions introducemonomials, associated to large coefficients, that do not belong toHN . Sampling more points would reduce the variance(11). But more importantly, for EZ to excel, this suggests to adapt the kernel to the basis where the integrand is known to besparse or to have fast-decaying coefficients.

B.3.1. INTEGRATING ABSOLUTE VALUE

We consider estimating the integral ∫[−1,1]d

d∏i=1

|xi|(1− xi)ai

(1− xi)bi

dxi (B.3)

where a1, b1 = − 12 and ai, bi i.i.d. uniformly in [− 1

2 ,12 ], using BH (7) and EZ (12) estimators.

Results are given in Figure B.5.

102 N10 8

10 6

10 4

10 2ar

BH -2.1, 1.00EZ -3.0, 0.88

(a) d = 1

102 N

10 2

10 1

ar

BH -1.4, 0.98EZ -1.7, 0.59

(b) d = 2

102 N

10 1

100

ar

BH -1.0, 0.95EZ -1.3, 0.68

(c) d = 3

102 N10 1

100

101ar

BH -0.7, 0.81EZ -1.2, 0.73

(d) d = 4

Figure B.5. Comparison of I BHN and I EZ

N for absolute value, cf. Section B.3.

Page 19: On two ways to use determinantal point processes …...DPP( ;K N) defines a probability measure over sets since (3) is invariant by permutation, the orthonormality of the ˚ ks yields

On two ways to use DPPs for MC integration

B.3.2. INTEGRATING HEAVISIDE

Let H(x) =

1, if x > 0

−1, otherwise. We consider estimating the integral

∫[−1,1]d

d∏i=1

H(xi)(1− xi)ai

(1− xi)bi

dxi (B.4)

where a1, b1 = − 12 and ai, bi i.i.d. uniformly in [− 1

2 ,12 ], using BH (7) and EZ (12) estimators.

Results are given in Figure B.6.

102 N10 4

10 3

10 2

10 1

100ar

BH -1.9, 1.00EZ -1.7, 0.38

(a) d = 1

102 N

10 1

100

ar

BH -1.4, 0.97EZ -0.6, 0.31

(b) d = 2

102 N

100

101

ar

BH -1.3, 0.97EZ -0.2, 0.07

(c) d = 3

102 N

101

102

ar

BH -1.1, 0.98EZ 0.1, 0.01

(d) d = 4

Figure B.6. Comparison of I BHN and I EZ

N for Heaviside function, cf. Section ??.

Page 20: On two ways to use determinantal point processes …...DPP( ;K N) defines a probability measure over sets since (3) is invariant by permutation, the orthonormality of the ˚ ks yields

On two ways to use DPPs for MC integration

B.3.3. INTEGRATING COSINE

We consider estimating the integral ∫[−1,1]d

d∏i=1

cos(πxi)(1− xi)ai

(1− xi)bi

dxi (B.5)

where a1, b1 = − 12 and ai, bi i.i.d. uniformly in [− 1

2 ,12 ], using BH (7) and EZ (12) estimators.

Results are given in Figure B.7

102 N

10 26

10 20

10 14

10 8

10 2ar

BH -2.1, 0.99EZ -1.1, 0.48

(a) d = 1

102 N

10 13

10 10

10 7

10 4

10 1

102ar

BH -1.5, 0.98EZ -16.1, 0.95

(b) d = 2

102 N

10 1

100

101

ar

BH -1.0, 0.97EZ -3.1, 0.83

(c) d = 3

102 N

100

101

ar

BH -0.9, 0.93EZ -0.7, 0.47

(d) d = 4

Figure B.7. Comparison of I BHN and I EZ

N for cosine, cf. Section ??.

Page 21: On two ways to use determinantal point processes …...DPP( ;K N) defines a probability measure over sets since (3) is invariant by permutation, the orthonormality of the ˚ ks yields

On two ways to use DPPs for MC integration

B.3.4. INTEGRATING A MIXTURE OF SMOOTH AND NON SMOOTH FUNCTIONS

Let f(x) = H(x)(cos(πx) + cos(2πx) + sin(5πx)). We consider estimating the integral∫[−1,1]d

d∏i=1

f(xi)(1− xi)ai

(1− xi)bi

dxi (B.6)

where a1, b1 = − 12 and ai, bi i.i.d. uniformly in [− 1

2 ,12 ], using BH (7) and EZ (12) estimators.

102 N10 4

10 3

10 2

ar

BH -1.8, 0.99EZ -1.1, 0.20

(a) d = 1

102 N

10 2

10 1

ar

BH -1.3, 0.98EZ -0.1, 0.06

(b) d = 2

102 N

10 1

100

ar

BH -1.2, 0.98EZ -0.2, 0.08

(c) d = 3

102 N

100

101ar

BH -1.1, 0.99EZ 0.1, 0.09

(d) d = 4

Figure B.8. Comparison of I BHN and I EZ

N , cf. Section ??.