spline-backfitted kernel smoothing of nonlinear additive

The Annals of Statistics2007, Vol. 35, No. 6, 2474–2503DOI: 10.1214/009053607000000488© Institute of Mathematical Statistics, 2007

SPLINE-BACKFITTED KERNEL SMOOTHING OF NONLINEARADDITIVE AUTOREGRESSION MODEL1

BY LI WANG AND LIJIAN YANG

University of Georgia and Michigan State University

Application of nonparametric and semiparametric regression techniquesto high-dimensional time series data has been hampered due to the lack ofeffective tools to address the “curse of dimensionality.” Under rather weakconditions, we propose spline-backfitted kernel estimators of the componentfunctions for the nonlinear additive time series data that are both compu-tationally expedient so they are usable for analyzing very high-dimensionaltime series, and theoretically reliable so inference can be made on the compo-nent functions with confidence. Simulation experiments have provided strongevidence that corroborates the asymptotic theory.

1. Introduction. For the past three decades, various nonparametric and semi-parametric regression techniques have been developed for the analysis of nonlin-ear time series; see, for example, [14, 21, 25], to name one article representative ofeach decade. Application to high-dimensional time series data, however, has beenhampered due to the scarcity of smoothing tools that are not only computation-ally expedient but also theoretically reliable, which has motivated the proposedprocedures of this paper.

In high-dimensional time series smoothing, one unavoidable issue is the “curseof dimensionality,” which refers to the poor convergence rate of nonparametricestimation of general multivariate functions. One solution is regression in the formof an additive model introduced by [9]:

Yi = m(Xi1, . . . ,Xid) + σ(Xi1, . . . ,Xid)εi,(1.1)

m(x1, . . . , xd) = c +d∑

α=1

mα(xα),

in which the sequence {Yi,XTi }ni=1 = {Yi,Xi1, . . . ,Xid}ni=1 is a length-n realiza-

tion of a (d + 1)-dimensional time series, the d-variate functions m and σ are themean and standard deviation of the response Yi conditional on the predictor vectorXi = {Xi1, . . . ,Xid}T , and each εi is a white noise conditional on Xi . In a nonlin-ear additive autoregression data-analytical context, each predictor Xiα,1 ≤ α ≤ d ,

Received December 2005; revised January 2007.1Supported in part by NSF Grant DMS-04-05330.AMS 2000 subject classifications. Primary 62M10; secondary 62G08.Key words and phrases. Bandwidths, B spline, knots, local linear estimator, mixing, Nadaraya–

Watson estimator, nonparametric regression.

2474

http://www.imstat.org/aos/

http://dx.doi.org/10.1214/009053607000000488

http://www.imstat.org

http://www.ams.org/msc/

ADDITIVE AUTOREGRESSION MODEL 2475

could be observed lagged values of Yi , such as Xiα = Yi−α , or of an exogenoustime series. Model (1.1), therefore, is the exact same nonlinear additive autoregres-sion model of [14] and [2] with exogenous variables. For identifiability, additivecomponent functions must satisfy the conditions Emα(Xiα) ≡ 0, α = 1, . . . , d .

We propose estimators of the unknown component functions {mα(·)}dα=1 basedon a geometrically α-mixing sample {Yi,Xi1, . . . ,Xid}ni=1 following model (1.1).If the data were actually i.i.d. observations instead of a time series realization,many methods would be available for estimating {mα(·)}dα=1. For instance, thereare four types of kernel-based estimators: the classic backfitting estimators (CBE)of [9] and [19]; marginal integration estimators (MIE) of [6, 17, 16, 22, 30] and akernel-based method of estimating rate to optimality of [10]; the smoothing back-fitting estimators (SBE) of [18]; and the two-stage estimators, such as one stepbackfitting of the integration estimators of [15], one step backfitting of the projec-tion estimators of [11] and one Newton step from the nonlinear LSE estimators of[12]. For the spline estimators, see [13, 23, 24] and [28].

In the time series context, however, there are fewer theoretically justified meth-ods due to the additional difficulty posed by dependence in the data. Some of theseare the kernel estimators via marginal integration of [25, 29], and the spline es-timators of [14]. In addition, [27] has extended the marginal integration kernelestimator to additive coefficient models for weakly dependent data. All of theseexisting methods are unsatisfactory in regard to either the computational or thetheoretical issue. The existing kernel methods are too computationally intensivefor high dimension d , thus limiting their applicability to a small number of pre-dictors. Spline methods, on the other hand, provide only convergence rates butno asymptotic distributions, so no measures of confidence can be assigned to theestimators.

If the last d − 1 component functions were known by “oracle,” one could create{Yi1,Xi1}ni=1 with Yi1 = Yi −c−∑d

α=2 mα(Xiα) = m1(Xi1)+σ(Xi1, . . . ,Xid)εi ,from which one could compute an “oracle smoother” to estimate the only unknownfunction m1(x1), thus effectively bypassing the “curse of dimensionality.” The ideaof [15] was to obtain an approximation to the unobservable variables Yi1 by sub-stituting mα(Xiα), i = 1, . . . , n, α = 2, . . . , d , with marginal integration kernel es-timates and arguing that the error incurred by this “cheating” is of smaller magni-tude than the rate O(n−2/5) for estimating the function m1(x1) from the unobserv-able data. We modify the procedure of [15] by substituting mα(Xiα), i = 1, . . . , n,α = 2, . . . , d , with spline estimators. Specifically, we propose a two-stage estima-tion procedure; first we pre-estimate {mα(xα)}dα=2 by its pilot estimator through anundersmoothed centered standard spline procedure; next we construct the pseudoresponse Yi1 and approximate m1(x1) by its Nadaraya–Watson estimator in (2.12).

The above proposed spline-backfitted kernel (SPBK) estimation method hasseveral advantages compared to most of the existing methods. First, as pointedout in [22], the estimator of [15] mixed up different projections, making it unin-terpretable if the real data generating process deviates from additivity, while the

2476 L. WANG AND L. YANG

projections in both steps of our estimator are with respect to the same measure.Second, since our pilot spline estimator is thousands of times faster than the pilotkernel estimator in [15], our proposed method is computationally expedient; seeTable 2. Third, the SPBK estimator can be shown to be as efficient as the “oraclesmoother” uniformly over any compact range, whereas [15] proved such “ora-cle efficiency” only at a single point. Moreover, the regularity conditions in ourpaper are natural and appealing and close to being minimal. In contrast, higher-order smoothness is needed with growing dimensionality of the regressors in [17].Stronger and more obscure conditions are assumed for the two-stage estimationproposed by [12].

The SPBK estimator achieves its seemingly surprising success by borrowingthe strengths of both spline and kernel: the spline does a quick initial estimationof all additive components and removes them all except the one of interest; ker-nel smoothing is then applied to the cleaned univariate data to estimate with as-ymptotic distribution. Propositions 4.1 and 5.1 are the keys in understanding theproposed estimators’ uniform oracle efficiency. They accomplish the well-known“reducing bias by undersmoothing” in the first step using spline and “averagingout the variance” in the second step with kernel, both steps taking advantage of thejoint asymptotics of kernel and spline functions, which is the new feature of ourproofs.

Reference [7] provides generalized likelihood ratio (GLR) tests for additivemodels using the backfitting estimator. A similar GLR test based on our SPBKestimator is feasible for future research.

The rest of the paper is organized as follows. In Section 2 we introduce theSPBK estimator and state its asymptotic “oracle efficiency” under appropriate as-sumptions. In Section 3 we provide some insights into the ideas behind our proofsof the main results, by decomposing the estimator’s “cheating” error into a biasand a variance part. In Section 4 we show the uniform order of the bias term. InSection 5 we show the uniform order of the variance term. In Section 6 we presentMonte Carlo results to demonstrate that the SPBK estimator does indeed possessthe claimed asymptotic properties. All technical proofs are contained in the Ap-pendix.

2. The SPBK estimator. In this section we describe the spline-backfitted ker-nel estimation procedure. For convenience, we denote vectors as x =(x1, . . . , xd) and take ‖ · ‖ as the usual Euclidean norm on Rd , that is, ‖x‖ =√∑d

α=1 x2α , and ‖ · ‖∞ the sup norm, that is, ‖x‖∞ = sup1≤α≤d |xα|. In what fol-

lows, let Yi and Xi = (Xi1, . . . ,Xid)T be the ith response and predictor vector.

Denote by Y = (Y1, . . . , Yn)T the response vector and (X1, . . . ,Xn)

T the designmatrix.

Let {Yi,XTi }ni=1 = {Yi,Xi1, . . . ,Xid}ni=1 be observations from a geometrically

α-mixing process following model (1.1). We assume that the predictor Xα is dis-


tributed on a compact interval [aα, bα], α = 1, . . . , d , and without loss of gen-erality, we take all intervals [aα, bα] = [0,1], α = 1, . . . , d . We preselect an in-teger N = Nn ∼ n2/5 logn; see assumption (A6) below. Next, we define for anyα = 1, . . . , d the first-order B spline function ([3], page 89), or say the constantB spline function is the indicator function IJ,α(xα) of the N + 1 equally spacedsubintervals of the finite interval [0,1] with length H = Hn = (N + 1)−1, that is,

IJ,α(xα) ={

1, JH ≤ xα < (J + 1)H,

0, otherwise,J = 0,1, . . . ,N.(2.1)

Define the following centered spline basis:

bJ,α(xα) = IJ+1,α(xα) − ‖IJ+1,α‖2

‖IJ,α‖2IJ,α(xα)

(2.2)∀α = 1, . . . , d, J = 1, . . . ,N,

with the standardized version given for any α = 1, . . . , d,

BJ,α(xα) = bJ,α(xα)

‖bJ,α‖2∀J = 1, . . . ,N.(2.3)

Define next the (1 + dN)-dimensional space G = G[0,1] of additive splinefunctions as the linear space spanned by {1,BJ,α(xα), α = 1, . . . , d, J = 1, . . . ,

N}, and denote by Gn ⊂ Rn the linear space spanned by {1, {BJ,α(Xiα)}ni=1, α =1, . . . , d, J = 1, . . . ,N}. As n → ∞, the dimension of Gn becomes 1 + dN withprobability approaching 1. The spline estimator of the additive function m(x) isthe unique element m(x) = mn(x) from the space G so that the vector {m(X1),. . . , m(Xn)}T best approximates the response vector Y. To be precise, we define

m(x) = λ′0 +

d∑α=1

N∑J=1

λ′J,αIJ,α(xα),(2.4)

where the coefficients (λ′0, λ

′1,1, . . . , λ

′N,d) are solutions of the least squares prob-

lem

{λ′0, λ

′1,1, . . . , λ

′N,d}T = arg min

RdN+1

n∑i=1

{Yi − λ0 −

d∑α=1

N∑J=1

λJ,αIJ,α(Xiα)

}2

.

Simple linear algebra shows that

m(x) = λ0 +d∑

α=1

N∑J=1

λJ,αBJ,α(xα),(2.5)

where (λ0, λ1,1, . . . , λN,d) are solutions of the least squares problem

{λ0, λ1,1, . . . , λN,d}T = arg minRdN+1

n∑i=1

{Yi − λ0 −

d∑α=1

N∑J=1

λJ,αBJ,α(Xiα)

}2

;(2.6)


while (2.4) is used for data-analytic implementation, the mathematically equivalentexpression (2.5) is convenient for asymptotic analysis.

The pilot estimators of each component function and the constant are

mα(xα) =N∑

J=1

λJ,αBJ,α(xα) − n−1n∑

i=1

N∑J=1

λJ,αBJ,α(Xiα),

(2.7)

mc = λ0 + n−1d∑

α=1

n∑i=1

N∑J=1

λJ,αBJ,α(Xiα).

These pilot estimators are then used to define new pseudo-responses Yi1, whichare estimates of the unobservable “oracle” responses Yi1. Specifically,

Yi1 = Yi − c −d∑

α=2

mα(Xiα), Yi1 = Yi − c −d∑

α=2

mα(Xiα),(2.8)

where c = Yn = n−1∑ni=1 Yi , which is a

√n-consistent estimator of c by the

central limit theorem. Next, we define the spline-backfitted kernel estimator ofm1(x1) as m∗

1(x1) based on {Yi1,Xi1}ni=1, which attempts to mimic the would-beNadaraya–Watson estimator m∗

1(x1) of m1(x1) based on {Yi1,Xi1}ni=1 if the unob-servable “oracle” responses {Yi1}ni=1 were available:

m∗1(x1) =

∑ni=1 Kh(Xi1 − x1)Yi1∑n

i=1 Kh(Xi1 − x1),

(2.9)

m∗1(x1) =

∑ni=1 Kh(Xi1 − x1)Yi1∑n

i=1 Kh(Xi1 − x1),

where Yi1 and Yi1 are defined in (2.8).Throughout this paper, on any fixed interval [a, b], we denote the space of

second-order smooth functions as C(2)[a, b] = {m|m′′ ∈ C[a, b]} and the class ofLipschitz continuous functions for any fixed constant C > 0 as Lip([a, b],C) ={m||m(x) − m(x′)| ≤ C|x − x′|, ∀x, x′ ∈ [a, b]}.

Before presenting the main results, we state the following assumptions.

(A1) The additive component function m1(x1) ∈ C(2)[0,1], while there is a con-stant 0 < C∞ < ∞ such that mβ ∈ Lip([0,1],C∞), ∀β = 2, . . . , d .

(A2) There exist positive constants K0 and λ0 such that α(n) ≤ K0e−λ0n holds for

all n, with the α-mixing coefficients for {Zi = (XTi , εi)}ni=1 defined as

α(k) = supB∈σ {Zs ,s≤t},C∈σ {Zs ,s≥t+k}

|P(B ∩ C) − P(B)P (C)|, k ≥ 1.(2.10)

(A3) The noise εi satisfies E(εi |Xi) = 0,E(ε2i |Xi ) = 1 and E(|εi |2+δ|Xi ) < Mδ

for some δ > 1/2 and a finite positive Mδ and σ(x) is continuous on [0,1]d :

0 < cσ ≤ infx∈[0,1]d

σ (x) ≤ supx∈[0,1]d

σ (x) ≤ Cσ < ∞.


(A4) The density function f (x) of X is continuous and

0 < cf ≤ infx∈[0,1]d

f (x) ≤ supx∈[0,1]d

f (x) ≤ Cf < ∞.

The marginal densities fα(xα) of Xα have continuous derivatives on [0,1]as well as the uniform upper bound Cf and lower bound cf .

(A5) The kernel function K ∈ Lip([−1,1],C∞) for some constant Ck > 0, and isbounded, nonnegative, symmetric and supported on [−1,1]. The bandwidthh ∼ n−1/5, that is, chn

−1/5 ≤ h ≤ Chn−1/5 for some positive constants Ch,

ch.

(A6) The number of interior knots N ∼ n2/5 logn, that is, cNn2/5 logn ≤ N ≤CNn2/5 logn for some positive constants cN , CN .

REMARK 2.1. The smoothness assumption of the true component functionsis greatly relaxed in our paper and we believe that our assumption (A1) is closeto being minimal. By the result of [20], a geometrically ergodic time series isa strongly mixing sequence. Therefore, assumption (A2) is suitable for (1.1) as atime series model under the aforementioned assumptions. Assumptions (A3)–(A5)are typical in the nonparametric smoothing literature; see, for instance, [5]. For(A6), the proof of Theorem 2.1 in the Appendix will make it clear that the numberof knots can be of the more general form N ∼ n2/5N ′, where the sequence N ′satisfies N ′ → ∞, n−θN ′ → 0 for any θ > 0. There is no optimal way to chooseN ′ as in the literature. Here we select N to be of barely larger order than n2/5.

The asymptotic property of the kernel smoother m∗1(x1) is well developed. Un-

der assumptions (A1)–(A5), it is straightforward to verify (as in [1]) that

supx1∈[h,1−h]

|m∗1(x1) − m1(x1)| = op(n−2/5 logn),

√nh{m∗

1(x1) − m1(x1) − b1(x1)h2} D→ N{0, v2

1(x1)},where

b1(x1) =∫

u2K(u)du{m′′1(x1)f1(x1)/2 + m′

1(x1)f′1(x1)}f −1

1 (x1),

(2.11)v2

1(x1) =∫

K2(u) duE[σ 2(X1, . . . ,Xd)|X1 = x1]f −11 (x1).

The following theorem states that the asymptotic uniform magnitude of the dif-ference between m∗

1(x1) and m∗1(x1) is of order op(n−2/5), which is dominated by

the asymptotic uniform size of m∗1(x1)−m1(x1). As a result, m∗

1(x1) will have thesame asymptotic distribution as m∗

1(x1).

THEOREM 2.1. Under assumptions (A1)–(A6), the SPBK estimator m∗1(x1)

given in (2.9) satisfies

supx1∈[0,1]

|m∗1(x1) − m∗

1(x1)| = op(n−2/5).


Hence with b1(x1) and v21(x1) as defined in (2.11), for any x1 ∈ [h,1 − h]

√nh{m∗

1(x1) − m1(x1) − b1(x1)h2} D→ N{0, v2

1(x1)}.

REMARK 2.2. Theorem 2.1 holds for m∗α(xα) similarly constructed as

m∗1(x1), for any α = 2, . . . , d , that is,

m∗α(xα) =

∑ni=1 Kh(Xiα − xα)Yiα∑n

i=1 Kh(Xi1 − xα),

(2.12)Yiα = Yi − c − ∑

1≤β≤d,β �=α

mβ(Xiβ),

where mβ(Xiβ), β = 1, . . . , d , are the pilot estimators of each component functiongiven in (2.7). Similar constructions can be based on a local polynomial instead ofthe Nadaraya–Watson estimator. For more on the properties of local polynomialestimators, in particular, their minimax efficiency, see [5].

REMARK 2.3. Compared to the SBE in [18], the variance term v1(x1) is iden-tical to that of SBE and the bias term b1(x1) is much more explicit than that ofSBE, at least when the Nadaraya–Watson smoother is used. Theorem 2.1 can beused to construct asymptotic confidence intervals. Under assumptions (A1)–(A6),for any α ∈ (0,1), an asymptotic 100(1 − α)% pointwise confidence interval form1(x1) is

m∗1(x1) − b1(x1)h

2 ± zα/2σ1(x1)

{∫K2(u) du

}1/2/{nhf1(x1)}1/2,(2.13)

where σ1(x1) and f1(x1) are estimators of E[σ 2(X1, . . . ,Xd)|X1 = x1] andf1(x1).

The following corollary provides the asymptotic distribution of m∗(x). Theproof of this corollary is straightforward and therefore omitted.

COROLLARY 2.1. Under assumptions (A1)–(A6) and the additional assump-tion that mα(xα) ∈ C(2)[0,1], α = 2, . . . , d , for any x ∈ [0,1]d , the SPBK estima-tor m∗

α(x), α = 1, . . . , d , is defined as given in (2.12). Let

m∗(x) = c +d∑

α=1

m∗α(xα), b(x) =

d∑α=1

bα(xα), v2(x) =d∑

α=1

v2α(xα).

Then√

nh{m∗(x) − m(x) − b(x)h2} D→ N{0, v2(x)}.


3. Decomposition. In this section we introduce some additional notation toshed some light on the ideas behind the proof of Theorem 2.1. For any functionsφ,ϕ on [0,1]d , define the empirical inner product and the empirical norm as

〈φ,ϕ〉2,n = n−1n∑

i=1

φ(Xi)ϕ(Xi ), ‖φ‖22,n = n−1

n∑i=1

φ2(Xi).

In addition, if the functions φ,ϕ are L2-integrable, define the theoretical innerproduct and its corresponding theoretical L2 norm as

〈φ,ϕ〉2 = E{φ(Xi )ϕ(Xi )}, ‖φ‖22 = E{φ2(Xi )}.

The evaluation of spline estimator m(x) at the n observations results in ann-dimensional vector, m(X1, . . . ,Xn) = {m(X1), . . . , m(Xn)}T , which can be con-sidered as the projection of Y on the space Gn with respect to the empirical innerproduct 〈·, ·〉2,n. In general, for any n-dimensional vector � ={�1, . . . ,�n}T , wedefine Pn�(x) as the spline function constructed from the projection of � on theinner product space (Gn, 〈·, ·〉2,n), that is,

Pn�(x) = λ0 +d∑

α=1

N∑J=1

λJ,αBJ,α(xα),

with the coefficients (λ0, λ1,1, . . . , λN,d) given in (2.6). Next, the multivariatefunction Pn�(x) is decomposed into the empirically centered additive componentsPn,α�(xα), α = 1, . . . , d , and the constant component Pn,c�:

Pn,α�(xα) = P∗n,α�(xα) − n−1

n∑i=1

P∗n,α�(Xiα),(3.1)

Pn,c� = λ0 + n−1d∑

α=1

n∑i=1

P∗n,α�(Xiα),(3.2)

where P ∗n,α�(xα) =∑N

J=1 λJ,αBJ,α(xα). With this new notation, we can rewritethe spline estimators m(x), mα(xα), mc defined in (2.5) and (2.7) as

m(x) = PnY(x), mα(xα) = Pn,αY(xα), mc = Pn,cY.

Based on the relation Yi = m(Xi )+σ(Xi)εi , one defines similarly the noiselessspline smoothers and the variance spline components,

m(x) = Pn{m(X)}(x), mα(xα) = Pn,α{m(X)}(xα),(3.3)

mc = Pn,c{m(X)},ε(x) = PnE(x), εα(xα) = Pn,αE(xα), εc = Pn,cE,(3.4)


where the noise vector E = {σ(Xi )εi}ni=1. Due to the linearity of the operators Pn,Pn,c, Pn,α , α = 1, . . . , d , one has the crucial decomposition

m(x) = m(x) + ε(x), mc = mc + εc,(3.5)

mα(xα) = mα(xα) + εα(xα),

for α = 1, . . . , d . As closer examination is needed later for ε(x) and εα(xα), wedefine in addition a = {a0, a1,1, . . . , aN,d}T as the minimizer of

n∑i=1

{σ(Xi )εi − a0 −

d∑α=1

N∑J=1

aJ,αBJ,α(Xiα)

}2

.(3.6)

Then ε(x) = aT B(x), where the vector B(x) and matrix B are defined as

B(x) = {1,B1,1(x1), . . . ,BN,d(xd)}T , B = {B(X1), . . . ,B(Xn)}T .(3.7)

Thus a = (BT B)−1BT E is the solution of (3.6) and specifically a is equal to{1 0T

dN

0dN 〈BJ,α,BJ ′,α′ 〉2,n

}−1

1 ≤ α,α′ ≤ d,1 ≤ J,J ′ ≤ N

(3.8)

×

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

1

n

n∑i=1

σ(Xi )εi

1

n

n∑i=1

BJ,α(Xiα)σ (Xi )εi

⎫⎪⎪⎪⎪⎬⎪⎪⎪⎪⎭

1 ≤ J ≤ N,1 ≤ α ≤ d

,

where 0p is a p-vector with all elements 0.Our main objective is to study the difference between the smoothed backfit-

ted estimator m∗1(x1) and the smoothed “oracle” estimator m∗

1(x1), both given in(2.9). From now on, we assume without loss of generality that d = 2 for notationalbrevity. Making use of the definition of c and the signal and noise decomposition(3.5), the difference m∗

1(x1) − m∗1(x1) − c + c can be treated as the sum of two

terms,

1/n∑n

i=1 Kh(Xi1 − x1){m2(Xi2) − m2(Xi2)}1/n

∑ni=1 Kh(Xi1 − x1)

(3.9)

= �b(x1) + �v(x1)

1/n∑n

i=1 Kh(Xi1 − x1),

where

�b(x1) = 1

n

n∑i=1

Kh(Xi1 − x1){m2(Xi2) − m2(Xi2)},(3.10)

�v(x1) = 1

n

n∑i=1

Kh(Xi1 − x1)ε2(Xi2).(3.11)


The term �b(x1) is induced by the bias term m2(Xi2) − m2(Xi2), while �v(x1) isrelated to the variance term ε2(Xi2). Both of these two terms have order op(n−2/5)

by Propositions 4.1 and 5.1 in the next two sections. Standard theory of kerneldensity estimation ensures that the denominator in (3.9), n−1∑n

i=1 Kh(Xi1 − x1),has a positive lower bound for x1 ∈ [0,1]. The additional nuisance term c − c isclearly of order Op(n−1/2) and thus op(n−2/5), which needs no further argumentsfor the proofs. Theorem 2.1 then follows from Propositions 4.1 and 5.1.

4. Bias reduction for �b(x1). In this section we show that the bias term�b(x1) of (3.10) is uniformly of order op(n−2/5) for x1 ∈ [0,1].

PROPOSITION 4.1. Under assumptions (A1), (A2) and (A4)–(A6),

supx1∈[0,1]

|�b(x1)| = Op(n−1/2 + H) = op(n−2/5).

LEMMA 4.1. Under assumption (A1), there exist functions g1, g2 ∈ G, suchthat ∥∥∥∥∥m − g +

2∑α=1

〈1, gα(Xα)〉2,n

∥∥∥∥∥2,n

= Op(n−1/2 + H),

where g(x) = c +∑2α=1 gα(xα) and m is defined in (3.3).

PROOF. According to the result on page 149 of [3], there is a constant C∞ > 0such that for the function gα ∈ G , ‖gα − mα‖∞ ≤ C∞H , α = 1,2. Thus ‖g −m‖∞ ≤∑2

α=1 ‖gα − mα‖∞ ≤ 2C∞H and ‖m − m‖2,n ≤ ‖g − m‖2,n ≤ 2C∞H .Noting that ‖m − g‖2,n ≤ ‖m − m‖2,n + ‖g − m‖2,n ≤ 4C∞H , one has

|〈gα(Xα),1〉2,n| ≤ |〈1, gα(Xα)〉2,n − 〈1,mα(Xα)〉2,n| + |〈1,mα(Xα)〉2,n|(4.1)

≤ C∞H + Op(n−1/2).

Therefore∥∥∥∥∥m − g +2∑

α=1

〈1, gα(Xα)〉2,n

∥∥∥∥∥2,n

≤ ‖m − g‖2,n +2∑

α=1

|〈1, gα(Xα)〉2,n|

≤ 6C∞H + Op(n−1/2) = Op(n−1/2 + H).

�

PROOF OF PROPOSITION 4.1. Denote

R1 = supx1∈[0,1]

∣∣∣∣∑n

i=1 Kh(Xi1 − x1){g2(Xi2) − m2(Xi2)}∑ni=1 Kh(Xi1 − x1)

∣∣∣∣,R2 = sup

x1∈[0,1]

∣∣∣∣∑n

i=1 Kh(Xi1 − x1){m2(Xi2) − g2(Xi2) + 〈1, g2(X2)〉2,n}∑ni=1 Kh(Xi1 − x1)

∣∣∣∣;


then supx1∈[0,1] |�b(x1)| ≤ |〈1, g2(X2)〉2,n| + R1 + R2. For R1, using the result onpage 149 of [3], one has R1 ≤ C∞H . To deal with R2, let B∗

J,2(xα) = BJ,2(xα) −〈1,BJ,2(Xα)〉2,n, for J = 1, . . . ,N , α = 1,2; then one can write

m(x) − g(x) +2∑

α=1

〈1, gα(Xα)〉2,n = a∗ +2∑

α=1

N∑J=1

a∗J,αB∗

J,α(xα).

Thus, n−1∑ni=1 Kh(Xi1 −x1){m2(Xi2)−g2(Xi2)+〈1, g2(X2)〉2,n} can be rewrit-

ten as n−1∑ni=1 Kh(Xi1 − x1)

∑NJ=1 a∗

J,2B∗J,2(Xi2), bounded by

N∑J=1

|a∗J,2| sup

1≤J≤N

∣∣∣∣∣n−1n∑

i=1

Kh(Xi1 − x1)B∗J,2(Xi2)

∣∣∣∣∣≤

N∑J=1

|a∗J,2|{

sup1≤J≤N

∣∣∣∣∣n−1n∑

i=1

ωJ (Xi , x1)

∣∣∣∣∣+ An,1

∣∣∣∣∣n−1n∑

i=1

Kh(Xi1 − x1)

∣∣∣∣∣},

where An,1 = supJ,α |〈1,BJ,α〉2,n − 〈1,BJ,α〉2| = Op(n−1/2 logn) as in (A.12)and ωJ (Xi , x1) is in (5.5) with mean μωJ

(x1). By Lemma A.3

supx1∈[0,1]

sup1≤J≤N

∣∣∣∣∣1nn∑

i=1

ωJ (Xi , x1)

∣∣∣∣∣≤ sup

x1∈[0,1]sup

1≤J≤N

∣∣∣∣∣1nn∑

i=1

ωJ (Xi , x1) − μωJ(x1)

∣∣∣∣∣+ sup

x1∈[0,1]sup

1≤J≤N

|μωJ(x1)|

= Op

(logn/

√nh)+ Op(H 1/2) = Op(H 1/2).

Therefore, one has

supx1∈[0,1]

∣∣∣∣∣n−1n∑

i=1

Kh(Xi1 − x1){m2(Xi2) − g2(Xi2) + 〈1, g2(X2)〉2,n}∣∣∣∣∣

≤{N

N∑J=1

(a∗J,2)

2

}1/2{Op(H 1/2) + Op

(logn√

n

)}= Op

({N∑

J=1

(a∗J,2)

2

}1/2)

= Op

(∥∥∥∥∥m − g +2∑

α=1

〈1, gα(Xα)〉2,n

∥∥∥∥∥2

)

= Op

(∥∥∥∥∥m − g +2∑

α=1

〈1, gα(Xα)〉2,n

∥∥∥∥∥2,n

),


where the last step follows from Lemma A.8. Thus, by Lemma 4.1,

R2 = Op(n−1/2 + H).(4.2)

Combining (4.1) and (4.2), one establishes Proposition 4.1. �

5. Variance reduction for �v(x1). In this section we will see that the term�v(x1) given in (3.11) is uniformly of order op(n−2/5). This is the most challeng-ing part to be proved, mostly done in the Appendix. Define an auxiliary entity

ε∗2 =

N∑J=1

aJ,2BJ,2(x2),(5.1)

where aJ,2 is given in (3.8). Definitions (3.1) and (3.2) imply that ε2(x2) definedin (3.4) is simply the empirical centering of ε∗

2(x2), that is,

ε2(x2) ≡ ε∗2(x2) − n−1

n∑i=1

ε∗2(Xi2).(5.2)

PROPOSITION 5.1. Under assumptions (A2)–(A6), one has

supx1∈[0,1]

|�v(x1)| = Op(H) = op(n−2/5).

According to (5.2), we can write �v(x1) = �(2)v (x1) − �

(1)v (x1), where

�(1)v (x1) = n−1

n∑l=1

Kh(Xl1 − x1) · n−1n∑

i=1

ε∗2(Xi2),(5.3)

�(2)v (x1) = n−1

n∑l=1

Kh(Xl1 − x1)ε∗2(Xl2),(5.4)

in which ε∗2(Xi2) is given in (5.1). Further one denotes

ωJ (Xl , x1) = Kh(Xl1 − x1)BJ,2(Xl2), μωJ(x1) = EωJ (Xl , x1).(5.5)

By (3.8) and (5.1), �(2)v (x1) can be rewritten as

�(2)v (x1) = n−1

n∑l=1

N∑J=1

aJ,2ωJ (Xl , x1).(5.6)

The uniform order of �(1)v (x1) and �

(2)v (x1) is given in the next two lemmas.

LEMMA 5.1. Under assumptions (A2)–(A6), �(1)v (x1) in (5.3) satisfies

supx1∈[0,1]

∣∣�(1)v (x1)

∣∣= Op{N(logn)2/n}.


PROOF. Based on (5.1),

n−1n∑

i=1

ε∗2(Xi2) ≤

∣∣∣∣∣N∑

J=1

aJ,2

∣∣∣∣∣ · sup1≤J≤N

∣∣∣∣∣1nn∑

i=1

BJ,2(Xi2)

∣∣∣∣∣.Lemma A.6 implies that∣∣∣∣∣

N∑J=1

aJ,2

∣∣∣∣∣≤{N

N∑J=1

a2J,2

}1/2

≤ {N aT a}1/2 = Op(Nn−1/2 logn).

By (A.12), sup1≤J≤N |n−1∑ni=1 BJ,2(Xi2)| ≤ An,1 = Op(n−1/2 logn), so

1

n

n∑i=1

ε∗2(Xi2) = Op{N(logn)2/n}.(5.7)

By assumption (A5) on the kernel function K , standard theory on kernel densityestimation entails that supx1∈[0,1] |n−1∑n

l=1 Kh(Xl1 − x1)| = Op(1). Thus with(5.7) the lemma follows immediately. �

LEMMA 5.2. Under assumptions (A2)–(A6), �(2)v (x1) in (5.4) satisfies

supx1∈[0,1]

∣∣�(2)v (x1)

∣∣= Op(H).

Lemma 5.2 follows from Lemmas A.10 and A.11. Proposition 5.1 follows fromLemmas 5.1 and 5.2.

6. Simulation example. In this section we carry out two simulation experi-ments to illustrate the finite-sample behavior of our SPBK estimators. The pro-gramming codes are available in both R 2.2.1 and XploRe. For information onXploRe, see [8] or visit www.xplore-stat.de.

The number of interior knots N for the spline estimation as in (2.6) will bedetermined by the sample size n and a tuning constant c. To be precise,

N = min([cn2/5 logn] + 1, [(n/2 − 1)d−1]),

in which [a] denotes the integer part of a. In our simulation study, we have usedc = 0.5,1.0. As seen in Table 1, the choice of c makes little difference, so wealways recommend to use c = 0.5 to save computation for massive data sets. Theadditional constraint that N ≤ (n/2 − 1)d−1 ensures that the number of terms inthe linear least squares problem (2.6), 1 + dN , is no greater than n/2, which isnecessary when the sample size n is moderate and the dimension d is high.

We have obtained for comparison both the SPBK estimator m∗α(xα) and the “or-

acle” estimator m∗α(xα) by Nadaraya–Watson regression estimation using a quartic

kernel and the rule-of-thumb bandwidth.

www.xplore-stat.de.


TABLE 1Report of Example 6.1

σ0 n cComponent #1 Component #2 Component #3

1st stage 2nd stage 1st stage 2nd stage 1st stage 2nd stage

0.5

1000.5 0.1231 0.0461 0.1476 0.0645 0.1254 0.06811.0 0.1278 0.0520 0.1404 0.0690 0.1318 0.0726

2000.5 0.0539 0.0125 0.0616 0.0275 0.0577 0.02521.0 0.0841 0.0144 0.0839 0.0290 0.0848 0.0285

5000.5 0.0263 0.0031 0.0306 0.0107 0.0278 0.01021.0 0.0595 0.0044 0.0578 0.0115 0.0605 0.0119

10000.5 0.0169 0.0015 0.0210 0.0053 0.0178 0.00541.0 0.0364 0.0018 0.0367 0.0054 0.0375 0.0059

1.0

1000.5 0.3008 0.0587 0.3298 0.1427 0.3236 0.13931.0 0.3088 0.0586 0.3369 0.1364 0.3062 0.1316

2000.5 0.1742 0.0256 0.1783 0.0802 0.1892 0.07011.0 0.2899 0.0328 0.2830 0.0824 0.3043 0.0721

5000.5 0.0924 0.0065 0.1124 0.0421 0.1004 0.03451.0 0.2299 0.0078 0.2305 0.0458 0.2314 0.0362

10000.5 0.0616 0.0033 0.0637 0.0270 0.0646 0.02241.0 0.1460 0.0034 0.1433 0.0275 0.1429 0.0219

Monte Carlo average squared errors (ASE) based on 100 replications.

We consider first the accuracy of the estimation, measured in terms of meanaverage squared error. Then to see that the SPBK estimator m∗

α(xα) is as efficientas the “oracle smoother” m∗

α(xα), we define the empirical relative efficiency ofm∗

α(xα) with respect to m∗α(xα) as

effα =[∑n

i=1{m∗α(Xiα) − mα(Xiα)}2∑n

i=1{m∗α(Xiα) − mα(Xiα)}2

]1/2

.(6.1)

Theorem 2.1 indicates that the effα should be close to 1 for all α = 1, . . . , d . Fig-ure 2 provides the kernel density estimations of the above empirical efficiencies toobserve the convergence.

EXAMPLE 6.1. A time series {Yt }n+3t=−1999 is generated according to the non-

linear additive autoregression model with sine functions given in [2],

Yt = 1.5 sin(

π

2Yt−2

)− 1.0 sin

(π

2Yt−3

)+ σ0εt , σ0 = 0.5,1.0,

where {εt }n+3t=−1996 are i.i.d. standard normal errors. Let XT

t = {Yt−1, Yt−2, Yt−3}.Theorem 3 on page 91 of [4] establishes that {Yt ,XT

t }n+3t=−1996 is geometrically er-

godic. The first 2000 observations are discarded to make {Yt }n+3t=1 behave like a


geometrically α-mixing and strictly stationary time series. The multivariate da-tum {Yt ,XT

t }n+3t=4 then satisfies assumptions (A1) to (A6) except that instead of

being [0,1], the range of Yt−α, α = 1,2,3, needs to be recalibrated. Since wehave no exact knowledge of the distribution of the Yt , we have generated manyrealizations of size 50,000 from which we found that more than 95% of the ob-servations fall in [−2.58,2.58] ([−3.14,3.14]) with σ0 = 0.5 (σ0 = 1). We willestimate the functions {mα(xα)}3

α=1 for xα ∈ [−2.58,2.58] ([−3.14,3.14]) withσ0 = 0.5 (σ0 = 1.0), where

m1(x1) ≡ 0, m2(x2) ≡ 1.5 sin(

π

2x2

)− E

[1.5 sin

(π

2Yt

)],

(6.2)

m3(x3) ≡ −1.0 sin(

π

2x3

)− E

[−1.0 sin

(π

2Yt

)].

We choose the sample size n to be 100, 200, 500 and 1000. Table 1 lists theaverage squared error (ASE) of the SPBK estimators and the constant spline pilotestimators from 100 Monte Carlo replications. As expected, increases in samplesize reduce ASE for both estimators and across all combinations of c values andnoise levels. Table 1 also shows that our SPBK estimators improve upon the splinepilot estimators immensely regardless of noise level and sample size, which im-plies that our second Nadaraya–Watson smoothing step is not redundant.

To have some impression of the actual function estimates, at noise level σ0 = 0.5with sample size n = 200, 500, we have plotted the oracle estimator (thin dottedlines), SPBK estimator m∗

α (thin solid lines) and their 95% pointwise confidenceintervals (upper and lower dashed curves) for the true functions mα (thick solidlines) in Figure 1. The visual impression of the SPBK estimators is rather satisfac-tory and their performance improves with increasing n.

To see the convergence, Figure 2(a) and (b) plots the kernel density estimationof the 100 empirical efficiencies for α = 2,3 and sample sizes n = 100,200,500and 1000 at the noise level σ0 = 0.5. The vertical line at efficiency = 1 is thestandard line for the comparison of m∗

α(xα) and m∗α(xα). One can clearly see that

the center of the density plots is going toward the standard line 1.0 with narrowerspread when sample size n is increasing, which is confirmative to the result ofTheorem 2.1.

EXAMPLE 6.2. Consider the nonlinear additive heteroscedastic model

Yt =d∑

α=1

sin(

π

2.5Xt−α

)+ σ(X)εt , εt

i.i.d.∼ N(0,1),

in which XTt = {Xt−1, . . . ,Xt−d} is a sequence of i.i.d. standard normal random

variables truncated by [−2.5,2.5] and

σ(X) = σ0

√d

2· 5 − exp(

∑dα=1 |Xt−α|/d)

5 + exp(∑d

α=1 |Xt−α|/d), σ0 = 0.1.


FIG. 1. Plots of the oracle estimator (dotted blue curve), SPBK estimator (solid red curve) and the95% pointwise confidence intervals constructed by (2.13) (upper and lower dashed red curves) of thefunction components mα(xα) in (6.2), α = 1,2,3 (solid green curve).


FIG. 2. Kernel density plots of the 100 empirical efficiencies of m∗α(xα) to m∗

α(xα), computedaccording to (6.1): (a) Example 6.1 (α = 2, d = 3); (b) Example 6.1 (α = 3, d = 3); (c) Example 6.2(α = 1, d = 30); (d) Example 6.2 (α = 2, d = 30).

By this choice of σ(X), we ensure that our design is heteroscedastic, and the vari-ance is roughly proportional to dimension d , which is intended to mimic the casewhen independent copies of the same kind of univariate regression problem aresimply added together.

For d = 30, we have run 100 replications for sample size n = 500,1000,1500and 2000. The kernel density estimation of the 100 empirical efficiencies for α =


TABLE 2The computing time of Example 6.1 (in seconds)

Method n = 100 n = 200 n = 400 n = 1000

MIE 10 76 628 10728SPBK 0.7 0.9 1.2 4.5

1,2 is graphically represented respectively in (c) and (d) of Figure 2. Again onesees that with increasing n, the efficiency distribution converges to 1.

Lastly, we provide the computing time of Example 6.1 from 100 replicationson an ordinary PC with Intel Pentium IV 1.86 GHz processor and 1.0 GB RAM.The average time run by XploRe to generate one sample of size n and compute theSPBK estimator and marginal integration estimator (MIE) is reported in Table 2.The MIEs have been obtained by directly recalling the “intest” in XploRe. Asexpected, the computing time for MIE is extremely sensitive to sample size due tothe fact that it requires n2 least squares in two steps. In contrast, at least for largesample data, our proposed SPBK is thousands of times faster than MIE. Thus ourSPBK estimation is feasible and appealing to deal with massive data sets.

APPENDIX

Throughout this section, an � bn means limn→∞ bn/an = 0 and an ∼ bn meanslimn→∞ bn/an = c, where c is some constant.

A.1. Preliminaries. We first give the Bernstein inequality for a geometricα-mixing sequence, which plays an important role through our proof.

LEMMA A.1 (Theorem 1.4, page 31 of [1]). Let {ξt , t ∈ Z} be a zero-meanreal-valued α-mixing process, Sn =∑n

i=1 ξi . Suppose that there exists c > 0 suchthat for i = 1, . . . , n, k = 3,4, . . . ,E|ξi |k ≤ ck−2k!Eξ2

i < +∞; then for eachn > 1, integer q ∈ [1, n/2], each ε > 0 and k ≥ 3,

P(|Sn| ≥ nε) ≤ a1 exp(− qε2

25m22 + 5cε

)+ a2(k)α

([n

q + 1

])2k/(2k+1)

,

where α(·) is the α-mixing coefficient defined in (2.10) and

a1 = 2n

q+ 2

(1 + ε2

25m22 + 5cε

), a2(k) = 11n

(1 + 5m

2k/(2k+1)k

ε

),

with mr = max1≤i≤n ‖ξi‖r , r ≥ 2.

LEMMA A.2. Under assumptions (A4) and (A6), one has:


(i) There exist constants C0(f ) and C1(f ) depending on the marginal den-sities fα(xα), α = 1,2, such that C0(f )H ≤ ‖bJ,α‖2

2 ≤ C1(f )H , where bJ,α isgiven in (2.2).

(ii) For any α = 1,2, |J ′ − J | ≤ 1, E{BJ,α(Xiα)BJ ′,α(Xiα)} ∼ 1; in addition

E|BJ,α(Xiα)BJ ′,α(Xiα)|k ∼ H 1−k, k ≥ 1,

where BJ,α and BJ ′,α are defined in (2.3).

We refer the proof of the above lemma to Lemma A.2 in [26].

LEMMA A.3. Under assumptions (A4)–(A6), for μωJ(x1) given in (5.5),

supx1∈[0,1]

sup1≤J≤N

|μωJ(x1)| = O(H 1/2).

PROOF. Denote the theoretical norm of IJ,α in (2.1) for α = 1,2, J =1, . . . ,N + 1,

cJ,α = ‖IJ,α‖22 =

∫I 2J,α(xα)fα(xα) dxα.(A.1)

By definition, |μωJ(x1)| = |E{Kh(Xl1 − x1)BJ,2(Xl2)}| is bounded by∫ ∫

Kh(u1 − x1)|BJ,2(u2)|f (u1,u2) du1 du2

= (‖bJ,2‖2)−1{∫ ∫

K(v1)IJ+1,2(u2)f (hv1 + x1, u2) dv1 du2

+(

cJ+1,2

cJ,2

)1/2∫ ∫K(v1)IJ,2(u2)f (hv1 + x1, u2) dv1 du2

}.

The boundedness of the joint density f and the Lipschitz continuity of the kernelK will then imply that

supx1∈[0,1]

sup1≤J≤N

∫ ∫K(v1)IJ,2(u2)f (hv1 + x1, u2) dv1 du2 ≤ CKCf H.

The proof of the lemma is then completed by (i) of Lemma A.2. �

LEMMA A.4. Under assumptions (A2) and (A4)–(A6), one has

supx1∈[0,1]

sup1≤J≤N

∣∣∣∣∣n−1n∑

l=1

{ωJ (Xl , x1) − μωJ(x1)}

∣∣∣∣∣= Op

(logn/

√nh),(A.2)

supx1∈[0,1]

sup1≤J≤N

∣∣∣∣∣n−1n∑

l=1

ωJ (Xl , x1)

∣∣∣∣∣= Op(H 1/2),(A.3)

where ωJ (Xl , x1) and μωJ(x1) are given in (5.5).


PROOF. For simplicity, denote ω∗J (Xl , x1) = ωJ (Xl , x1) − μωJ

(x1). Then

E{ω∗J (Xl , x1)}2 = Eω2

J (Xl , x1) − μ2ωJ

(x1),

while Eω2J (Xl , x1) is equal to

h−1‖bJ,2‖−22

∫ ∫K2(v1)

{IJ+1,2(u2) + cJ+1,2

cJ,2IJ,2(u2)

}

× f (hv1 + x1, u2) dv1 du2,

where cJ,α is given in (A.1). So Eω2J (Xl , x1) ∼ h−1 and Eω2

J (Xl , x1) � μ2ωJ

(x1).Hence for n sufficiently large, E{ω∗

J (Xl , x1)}2 = Eω2J (Xl , x1) − μ2

ωJ(x1) ≥

c∗h−1, for some positive constant c∗. When r ≥ 3, the r th moment E|ωJ (Xl , x1)|ris

1

‖bJ,2‖r2

∫ ∫Kr

h(u1 − x1)

{IJ+1,2(u2) +

(cJ+1,2

cJ,2

)r

IJ,2(u2)

}f (u1,u2) du1 du2.

It is clear that E|ωJ (Xl , x1)|r ∼ h(1−r)H 1−r/2. According to Lemma A.3, one has|EωJ (Xl , x1)|r ≤ CHr/2, thus E|ωJ (Xl , x1)|r � |μωJ

(x1)|r . In addition

E|ω∗J (Xl , x1)|r ≤

{c

hH 1/2

}(r−2)

r!E|ω∗J (Xl , x1)|2,

so there exists c∗ = ch−1H−1/2 such that E|ω∗J (Xl , x1)|r ≤ cr−2∗ r!E|ω∗

J (Xl , x1)|2,which implies that {ω∗

J (Xl , x1)}nl=1 satisfies Cramér’s condition. By Bernstein’sinequality, for r = 3

P

{∣∣∣∣∣1nn∑

l=1

ω∗J (Xl , x1)

∣∣∣∣∣≥ ρn

}≤ a1 exp

(− qρ2

n

25m22 + 5c∗ρn

)+a2(3)α

([n

q + 1

])6/7

with m22 ∼ h−1, m3 = max1≤i≤n ‖ω∗

J (Xl , x1)‖3 ≤ {C0(2h−1)2}1/3 and

ρn = ρlogn√

nh, a1 = 2

n

q+ 2

(1 + ρ2

n

25m22 + 5c∗ρn

),

a2(3) = 11n

(1 + 5m

6/73

ρn

).

Observe that 5c∗ρn = o(1); by taking q such that [ nq+1 ] � c0 logn, q � c1n/ logn

for some constants c0, c1, one has a1 = O(n/q) = O(logn), a2(3) = o(n2). As-sumption (A2) yields that α([ n

q+1 ])6/7 ≤ Cn−6λ0c0/7. Thus, for n large enough,

P

{1

n

∣∣∣∣∣n∑

l=1

ω∗J (Xl , x1)

∣∣∣∣∣> ρ logn√nh

}≤ cn−c2ρ

2logn + Cn2−6λ0c0/7.(A.4)


We divide the interval [0,1] into Mn ∼ n6 equally spaced intervals with dis-joint endpoints 0 = x1,0 < x1,1 < · · · < x1,Mn = 1. Employing the discretizationmethod,

supx1∈[0,1]

sup1≤J≤N

∣∣∣∣∣n−1n∑

l=1

ω∗J (Xl , x1)

∣∣∣∣∣= sup

0≤k≤Mn

sup1≤J≤N

∣∣∣∣∣n−1n∑

l=1

ω∗J (Xl , x1,k)

∣∣∣∣∣(A.5)

+ sup1≤k≤Mn

sup1≤J≤N

supx1∈[x1,k−1,x1,k]

∣∣∣∣∣n−1n∑

l=1

{ω∗J (Xl , x1) − ω∗

J (Xl , x1,k)}∣∣∣∣∣.

By (A.4), there exists a large enough value ρ > 0 such that for any 1 ≤ k ≤ Mn,

P

{1

n

∣∣∣∣∣n∑

l=1

ω∗J (Xl , x1,k)

∣∣∣∣∣> ρ(nh)−1/2 logn

}≤ n−10, 1 ≤ J ≤ N,

which implies that∞∑

n=1

P

{sup

0≤k≤Mn

sup1≤J≤N

∣∣∣∣∣n−1n∑

l=1

ω∗J (Xl , x1,k)

∣∣∣∣∣≥ ρlogn√

nh

}

≤∞∑

n=1

Mn∑k=1

N∑J=1

P

{∣∣∣∣∣n−1n∑

l=1

ω∗J (Xl , x1,k)

∣∣∣∣∣≥ ρlogn√

nh

}

≤∞∑

n=1

NMnn−10 < ∞.

Thus, the Borel–Cantelli lemma entails that

sup0≤k≤Mn

sup1≤J≤N

∣∣∣∣∣n−1n∑

l=1

ω∗J (Xl , x1,k)

∣∣∣∣∣= Op

(logn/

√nh).(A.6)

Employing Lipschitz continuity of the kernel K , for x1 ∈ [x1,k−1, x1,k]sup

1≤k≤Mn

|Kh(Xl1 − x1) − Kh(Xl1 − x1,k)| ≤ CKM−1n h−2.

According to the fact that Mn ∼ n6, one has

sup1≤k≤Mn

sup1≤J≤N


∣∣∣∣∣n−1n∑

l=1

{ω∗J (Xl , x1) − ω∗

J (Xl , x1,k)}∣∣∣∣∣

≤ CKM−1n h−2 sup

x2∈[0,1]sup

1≤J≤N

|BJ,2(x2)|

= O(M−1n h−2H−1/2) = o(n−1).


Thus (A.2) follows instantly from (A.5) and (A.6). As a result of Lemma A.3 and(A.2), (A.3) holds. �

LEMMA A.5. Under assumptions (A4) and (A6), there exist constants C0 >

c0 > 0 such that for any a = (a0, a1,1, . . . , aN,1,a1,2, . . . , aN,2),

c0

(a2

0 +∑J,α

a2J,α

)≤∥∥∥∥∥a0 +∑

J,α

aJ,αBJ,α

∥∥∥∥∥2

2

≤ C0

(a2

0 +∑J,α

a2J,α

).(A.7)

We refer the proof of the above lemma to Lemma A.4 in [26]. The next lemmaprovides the size of aT a, where a is the least squares solution defined by (3.6).

LEMMA A.6. Under assumptions (A2)–(A6), a satisfies

aT a = a20 +

N∑J=1

2∑α=1

a2J,α = Op{N(logn)2/n}.(A.8)

PROOF. According to (3.7) and (3.8), aT BT Ba = aT (BT E). Thus

‖Ba‖22,n = aT

(1

〈BJ,α,BJ ′,α′ 〉2,n

)a = aT (n−1BT E).(A.9)

By (A.15), ‖Ba‖22,n is bounded below in probability by (1−An)‖Ba‖2

2. Accordingto (A.7), one has

‖Ba‖22 =

∥∥∥∥∥a20 +

N∑J=1

2∑α=1

a2J,α

∥∥∥∥∥2

2

≥ c0

(a2

0 +∑J,α

a2J,α

).(A.10)

Meanwhile one can show that aT (n−1BT E) is bounded above by

√a2

0 +∑J,α

a2J,α

[{1

n

n∑i=1

σ(Xi )εi

}2

(A.11)

+∑J,α

{1

n

n∑i=1


}2]1/2

.

Combining (A.9), (A.10) and (A.12), the squared norm aT a is bounded by

c−20 (1 − An)

−2

[{1

n

n∑i=1

σ(Xi )εi

}2

+∑J,α

{1

n

n∑i=1


}2].

Using the same truncation of ε as in Lemma A.11, the Bernstein inequality entails


that∣∣∣∣∣n−1n∑

i=1

σ(Xi )εi

∣∣∣∣∣+ max1≤J≤N,α=1,2

∣∣∣∣∣n−1n∑

i=1


∣∣∣∣∣= Op

(logn/

√n).

Therefore (A.8) holds since An is of order op(1). �

A.2. Empirical approximation of the theoretical inner product.

LEMMA A.7. Under assumptions (A2), (A4) and (A6), one has

supJ,α

|〈1,BJ,α〉2,n − 〈1,BJ,α〉2| = Op(n−1/2 logn),(A.12)

supJ,J ′,α

|〈BJ,α,BJ ′,α〉2,n − 〈BJ,α,BJ ′,α〉2| = Op(n−1/2H−1/2 logn),(A.13)

sup1≤J,J ′≤N,α �=α′

|〈BJ,α,BJ ′,α′ 〉2,n − 〈BJ,α,BJ ′,α′ 〉2|(A.14)

= Op(n−1/2 logn).

We refer the proof of the above lemma to Lemma A.7 in [26].

LEMMA A.8. Under assumptions (A2), (A4) and (A6), one has

An = supg1,g2∈G(−1)

|〈g1, g2〉2,n − 〈g1, g2〉2|‖g1‖2‖g2‖2

= Op

(logn

n1/2H 1/2

)= op(1).(A.15)

PROOF. For every g1, g2 ∈ G(−1), one can write

g1(X1,X2) = a0 +N∑

J=1

2∑α=1

aJ,αBJ,α(Xα),

g2(X1,X2) = a′0 +

N∑J ′=1

2∑α′=1

a′J ′,α′BJ ′,α′(Xα′),

where for any J,J ′ = 1, . . . ,N,α,α′ = 1,2, aJ,α and aJ ′,α′ are real constants.Then

|〈g1, g2〉2,n − 〈g1, g2〉2| ≤∣∣∣∣∣∑J,α

〈a′0, aJ,αBJ,α〉2,n

∣∣∣∣∣+∣∣∣∣∣∑J ′,α′

〈a0, a′J ′,α′BJ ′,α′ 〉2,n

∣∣∣∣∣+ ∑

J,J ′,α,α′|aJ,α||a′

J ′,α′ ||〈BJ,α,BJ ′,α′ 〉2,n

− 〈BJ,α,BJ ′,α′ 〉2|= L1 + L2 + L3.


The equivalence of norms given in (A.7) and (A.12) leads to

L1 ≤ An,1 · |a′0| ·∑J,α

|aJ,α|

≤ C0An,1

(a′2

0 +∑J,α

a′2J,α

)1/2(∑J,α

a2J,α

)1/2

N1/2

≤ CA,1An,1‖g1‖2‖g2‖2H−1/2

= Op(n−1/2H−1/2 logn)‖g1‖2‖g2‖2.

Similarly, L2 = Op(n−1/2H−1/2 logn)‖g1‖2‖g2‖2. By the Cauchy–Schwarz in-equality

L3 ≤ ∑J,J ′,α,α′

|aJ,α||a′J ′,α′ |max(An,2,An,3)

≤ CA,2 max(An,2,An,3)‖g1‖2‖g2‖2

= Op(n−1/2H−1/2 logn)‖g1‖2‖g2‖2.

Therefore, statement (A.15) is established. �

A.3. Proof of Lemma 5.2. Denote V as the theoretical inner product of the Bspline basis {1,BJ,α(xα), J = 1, . . . ,N, α = 1,2}, that is,

V =(

1 0T2N

02N 〈BJ,α,BJ ′,α′ 〉2

)1 ≤ α,α′ ≤ 2,1 ≤ J,J ′ ≤ N

,(A.16)

where 0p = {0, . . . ,0}T . Let S be the inverse matrix of V, that is,

S = V−1 =⎛⎝ 1 0T

N 0TN

0N V11 V120N V21 V22

⎞⎠

−1

=⎛⎝ 1 0T

N 0TN

0N S11 S120N S21 S22

⎞⎠ .(A.17)

LEMMA A.9. Under assumptions (A4) and (A6), for V, S defined in (A.16),(A.17), there exist constants CV > cV > 0 and CS > cS > 0 such that

cV I2N+1 ≤ V ≤ CV I2N+1, cSI2N+1 ≤ S ≤ CSI2N+1.

We refer the proof of the above lemma to Lemma A.9 in [26]. Next we denote

V∗ =(

0 0T2N

02N 〈BJ,α,BJ ′,α′ 〉2,n − 〈BJ,α,BJ ′,α′ 〉2

)1 ≤ α,α′ ≤ 2.

1 ≤ J,J ′ ≤ N

.


Then a in (3.8) can be rewritten as

a = (BT B)−1BT E =(

1

nBT B

)−1(1

nBT E

)(A.18)

= (V + V∗)−1(

1

nBT E

).

Now define a = {a0, a1,1, . . . , aN,1, a1,2, . . . , aN,2}T as

a = V−1(n−1BT E) = S(n−1BT E),(A.19)

and define a theoretical version of �(2)v (x1) in (5.6) as

�(2)v (x1) = n−1

n∑i=1

N∑J=1

aJ,2ωJ (Xi , x1).(A.20)

LEMMA A.10. Under assumptions (A2) to (A6),

supx1∈[0,1]

∣∣�(2)v (x1) − �(2)

v (x1)∣∣= Op{(logn)2/(nH)}.

PROOF. According to (A.18) and (A.19), one has Va = (V + V∗)a, whichimplies that V∗a = V(a − a). Using (A.13) and (A.14), one obtains that

‖V(a − a)‖ = ‖V∗a‖ ≤ Op(n−1/2H−1 logn)‖a‖.According to Lemma A.6, ‖a‖ = Op(n−1/2N1/2 logn), so one has

‖V(a − a)‖ ≤ Op{(logn)2n−1N3/2}.By Lemma A.9, ‖(a − a)‖ = Op{(logn)2n−1N3/2}. Lemma A.6 then implies

‖a‖ ≤ ‖(a − a)‖ + ‖a‖ = Op

(logn

√N/n

).(A.21)

Additionally, |�(2)v (x1) − �

(2)v (x1)| = |∑N

J=1(aJ,2 − aJ,2)1n

∑nl=1 ωJ (Xl , x1)|. So

supx∈[0,1]

∣∣�(2)v (x1) − �(2)

v (x1)∣∣≤ √

NOp

{(logn)2

nH

}Op(H 1/2) = Op

{(logn)2

nH

}.

Therefore the lemma follows. �

LEMMA A.11. Under assumptions (A2)–(A6), for �(2)v (x1) as defined in

(A.20), one has

supx1∈[0,1]

∣∣�(2)v (x1)

∣∣= supx1∈[0,1]

∣∣∣∣∣n−1n∑

i=1

Kh(Xi1 − x1)

N∑J=1

aJ,2BJ,2(Xi2)

∣∣∣∣∣= Op(H).


PROOF. Note that

∣∣�(2)v (x1)

∣∣ ≤∣∣∣∣∣

N∑J=1

aJ,2μωJ(x1)

∣∣∣∣∣+∣∣∣∣∣

N∑J=1

aJ,2n−1

n∑i=1

{ωJ (Xi , x1) − μωJ(x1)}

∣∣∣∣∣(A.22)

= Q1(x1) + Q2(x1).

By the Cauchy–Schwarz inequality, (A.21) Lemma A.4 and assumptions (A5),(A6),

supx1∈[0,1]

Q2(x1) = Op

(logn

√N/n

)√NOp

(logn√

nh

)= Op

{(logn)3

√n

}.(A.23)

Using the discretization idea again as in the proof of Lemma A.4, one has

supx1∈[0,1]

Q1(x1)

≤ max1≤k≤Mn

∣∣∣∣∣N∑

J=1

aJ,2μωJ(x1,k)

∣∣∣∣∣(A.24)

+ max1≤k≤Mn


∣∣∣∣∣N∑

J=1

aJ,2μωJ(x1) −

N∑J=1

aJ,2μωJ(x1,k)

∣∣∣∣∣= T1 + T2,

where Mn ∼ n. Define next

W1 = max1≤k≤Mn

∣∣∣∣∣n−1∑

1≤i≤n

∑1≤J,J ′≤N

μωJ(x1,k)sJ+N+1,J

′+1BJ′,1(Xi1)σ (Xi )εi

∣∣∣∣∣,

W2 = max1≤k≤Mn

∣∣∣∣∣n−1∑

1≤i≤n

∑1≤J,J ′≤N

μωJ(x1,k)sJ+N+1,J

′+N+1BJ′,2(Xi2)σ (Xi )εi

∣∣∣∣∣.Then it is clear that T1 ≤ W1 + W2. Next we will show that W1 = Op(H). LetDn = nθ0( 1

2+δ< θ0 < 2

5), where δ is the same as in assumption (A3). Define

ε−i,D = εiI (|εi | ≤ Dn), ε+

i,D = εiI (|εi | > Dn), ε∗i,D = ε−

i,D − E(ε−i,D|Xi ),

Ui,k = μω(x1,k)T S21{B1,1(Xi1), . . . ,B1,N (Xi1)}T σ (Xi )ε

∗i,D.

Denote WD1 = max1≤k≤Mn |n−1∑n

i=1 Ui,k| as the truncated centered version ofW1. Next we show that |W1 − WD

1 | = Op(H). Note that |W1 − WD1 | ≤ �1 + �2,


where

�1 = max1≤k≤Mn

∣∣∣∣∣1nn∑

i=1

∑1≤J,J

′≤N

μωJ(x1,k)sJ+N+1,J

′+1

× BJ

′,1(Xi1)σ (Xi)E(ε−

i,D|Xi )

∣∣∣∣∣,


∣∣∣∣∣1nn∑

i=1

∑1≤J,J

′≤N

μωJ(x1,k)sJ+N+1,J

′+1BJ′,1(Xi1)σ (Xi )ε

+i,D

∣∣∣∣∣.Let μω(x1,k) = {μω1(x1,k), . . . ,μωN

(x1,k)}T ; then


∣∣∣∣∣μω(x1,k)T S21

{n−1

n∑i=1

BJ

′,1(Xi1)σ (Xi )E(ε−

i,D|Xi )

}N

J′=1

∣∣∣∣∣≤ CS max

1≤k≤Mn

{N∑

J=1

μ2ωJ

(x1,k)

N∑J=1

{1

n

n∑i=1

BJ,1(Xi1)σ (Xi )E(ε−i,D|Xi)

}2}1/2

.

By assumption (A3), one has |E(ε−i,D|Xi )| = |E(ε+

i,D|Xi )| ≤ MδD−(1+δ)n and

Lemma A.1 entails that supJ,α | 1n

∑ni=1 BJ,1(Xi1)σ (Xi )| = Op(logn/

√n). There-

fore

�1 ≤ MδD−(1+δ)n max

1≤k≤Mn

[N∑

J=1

μ2ωJ

(x1,k)

N∑J=1

{1

n

n∑i=1

BJ,1(Xi1)σ (Xi )

}2]1/2

= Op

{ND−(1+δ)

n log2 n/n}= Op(H),

where the last step follows from the choice of Dn. Meanwhile∞∑

n=1

P(|εn| ≥ Dn) ≤∞∑

n=1

E|εn|2+δ

D2+δn

=∞∑

n=1

E(E|εn|2+δ|Xn)

D2+δn

≤∞∑

n=1

Mδ

D2+δn

< ∞,

since δ > 1/2. By the Borel–Cantelli lemma, one has with probability 1,

n−1n∑

i=1

∑1≤J,J

′≤N

μωJ(x1,k)sJ+N+1,J

′+1BJ′,1(Xi1)σ (Xi )ε

+i,D = 0,

for large n. Therefore, one has |W1 − WD1 | ≤ �1 + �2 = Op(H). Next we will

show that WD1 = Op(H). Note that the variance of Ui,k is

μω(x1,k)T S21 var({B1,1(Xi1), . . . ,BN,1(Xi1)}T σ (Xi )ε

∗i,D)S21μω(x1,k).

By assumption (A3), c2σ V11 ≤ var({B1,1(Xi1), . . . ,BN,1(Xi1)}T σ (Xi )) ≤ C2

σ V11,

var(Ui,k) ∼ μω(x1,k)T S21V11S21μω(x1,k)Vε,D = μω(x1,k)

T S21μω(x1,k)Vε,D,


where Vε,D = var{ε∗i,D|Xi .}. Let κ(x1,k) = {μω(x1,k)

T μω(x1,k)}1/2. Then

cSc2σ {κ(x1,k)}2Vε,D ≤ var(Ui,k) ≤ CSC2

σ {κ(x1,k)}2Vε,D.

Simple calculation leads to

E|Ui,k|r ≤ {c0κ(x1,k)DnH−1/2}r−2r!E|Ui,k|2 < +∞

for r ≥ 3, so {Ui,k}ni=1 satisfies Cramér’s condition with Cramér’s constant c∗ =c0κ(x1,k)DnH

−1/2; hence by Bernstein’s inequality

P

{∣∣∣∣∣n−1n∑

l=1

Ui,k

∣∣∣∣∣≥ ρn

}≤ a1 exp

(− qρ2

n

25m22 + 5c∗ρn

)+ a2(3)α

([n

q + 1

])6/7

,

where m22 ∼ {κ(x1,k)}2Vε,D,m3 ≤ {c{κ(x1,k)}3H−1/2DnVε,D}1/3,

ρn = ρH, a1 = 2n

q+2(

1+ ρ2n

25m22 + 5c∗ρn

), a2(3) = 11n

(1+ 5m

6/73

ρn

).

Similar arguments as in Lemma A.4 yield that as n → ∞qρ2

n

25m22 + 5c∗ρn

∼qρn

c∗= ρn2/5

c0(logn)5/2Dn

→ +∞.

Taking c0, ρ large enough, one has

P

{1

n

∣∣∣∣∣n∑

i=1

Ui,k

∣∣∣∣∣> ρH

}≤ c logn exp{−c2ρ

2 logn} + Cn2−6λ0c0/7 ≤ n−3,

for n large enough. Hence

∞∑n=1

P(|WD1 | ≥ ρH) =

∞∑n=1

Mn∑k=1

P

(∣∣∣∣∣1nn∑

i=1

Ui,k

∣∣∣∣∣≥ ρH

)≤

∞∑n=1

Mnn−3 < ∞.

Thus, the Borel–Cantelli lemma entails that WD1 = Op(H). Noting the fact that

|W1 − WD1 | = Op(H), one has that W1 = Op(H). Similarly W2 = Op(H). Thus

T1 ≤ W1 + W2 = Op(H).(A.25)

Employing the Cauchy–Schwarz inequality and Lipschitz continuity of the kernelK , assumption (A5), Lemma A.2(ii) and (A.21) lead to

T2 ≤ Op

(N1/2 logn

n1/2

){∑NJ=1 EB2

J,2(X12)}1/2

h2Mn

= op(n−1/2).(A.26)

Combining (A.24), (A.25) and (A.26), one has supx1∈[0,1] Q1(x1) = Op(H). Thedesired result follows from (A.23) and (A.23). �


Acknowledgments. This work is part of the first author’s dissertation underthe supervision of the second author. The authors are very grateful to the Editor,Jianqing Fan, and three anonymous referees for their helpful comments.

REFERENCES

[1] BOSQ, D. (1998). Nonparametric Statistics for Stochastic Processes, 2nd ed. Lecture Notes inStatist. 110. Springer, New York. MR1640691

[2] CHEN, R. and TSAY, R. S. (1993). Nonlinear additive ARX models. J. Amer. Statist. Assoc. 88955–967.

[3] DE BOOR, C. (2001). A Practical Guide to Splines, rev. ed. Springer, New York. MR1900298[4] DOUKHAN, P. (1994). Mixing: Properties and Examples. Lecture Notes in Statist. 85. Springer,

New York. MR1312160[5] FAN, J. and GIJBELS, I. (1996). Local Polynomial Modelling and Its Applications. Chapman

and Hall, London. MR1383587[6] FAN, J., HÄRDLE, W. and MAMMEN, E. (1998). Direct estimation of low-dimensional com-

ponents in additive models. Ann. Statist. 26 943–971. MR1635422[7] FAN, J. and JIANG, J. (2005). Nonparametric inferences for additive models. J. Amer. Statist.

Assoc. 100 890–907. MR2201017[8] HÄRDLE, W., HLÁVKA, Z. and KLINKE, S. (2000). XploRe Application Guide. Springer,

Berlin.[9] HASTIE, T. J. and TIBSHIRANI, R. J. (1990). Generalized Additive Models. Chapman and

Hall, London. MR1082147[10] HENGARTNER, N. W. and SPERLICH, S. (2005). Rate optimal estimation with the inte-

gration method in the presence of many covariates. J. Multivariate Anal. 95 246–272.MR2170397

[11] HOROWITZ, J., KLEMELÄ, J. and MAMMEN, E. (2006). Optimal estimation in additive re-gression. Bernoulli 12 271–298. MR2218556

[12] HOROWITZ, J. and MAMMEN, E. (2004). Nonparametric estimation of an additive model witha link function. Ann. Statist. 32 2412–2443. MR2153990

[13] HUANG, J. Z. (1998). Projection estimation in multiple regression with application to func-tional ANOVA models. Ann. Statist. 26 242–272. MR1611780

[14] HUANG, J. Z. and YANG, L. (2004). Identification of nonlinear additive autoregressive models.J. R. Stat. Soc. Ser. B Stat. Methodol. 66 463–477. MR2062388

[15] LINTON, O. B. (1997). Efficient estimation of additive nonparametric regression models. Bio-metrika 84 469–473. MR1467061

[16] LINTON, O. B. and HÄRDLE, W. (1996). Estimation of additive regression models with knownlinks. Biometrika 83 529–540. MR1423873

[17] LINTON, O. B. and NIELSEN, J. P. (1995). A kernel method of estimating structured nonpara-metric regression based on marginal integration. Biometrika 82 93–100. MR1332841

[18] MAMMEN, E., LINTON, O. and NIELSEN, J. (1999). The existence and asymptotic propertiesof a backfitting projection algorithm under weak conditions. Ann. Statist. 27 1443–1490.MR1742496

[19] OPSOMER, J. D. and RUPPERT, D. (1997). Fitting a bivariate additive model by local polyno-mial regression. Ann. Statist. 25 186–211. MR1429922

[20] PHAM, D. T. (1986). The mixing property of bilinear and generalized random coefficient au-toregressive models. Stochastic Process. Appl. 23 291–300. MR0876051

[21] ROBINSON, P. M. (1983). Nonparametric estimators for time series. J. Time Ser. Anal. 4 185–207. MR0732897

http://www.ams.org/mathscinet-getitem?mr=1640691




















[22] SPERLICH, S., TJØSTHEIM, D. and YANG, L. (2002). Nonparametric estimation and testingof interaction in additive models. Econometric Theory 18 197–251. MR1891823

[23] STONE, C. J. (1985). Additive regression and other nonparametric models. Ann. Statist. 13689–705. MR0790566

[24] STONE, C. J. (1994). The use of polynomial splines and their tensor products in multivariatefunction estimation (with discussion). Ann. Statist. 22 118–184. MR1272079

[25] TJØSTHEIM, D. and AUESTAD, B. (1994). Nonparametric identification of nonlinear time se-ries: Projections. J. Amer. Statist. Assoc. 89 1398–1409. MR1310230

[26] WANG, L. and YANG, L. (2006). Spline-backfitted kernel smoothing of nonlinear additiveautoregression model. Manuscript. Available at www.arxiv.org/abs/math/0612677.

[27] XUE, L. and YANG, L. (2006). Estimation of semiparametric additive coefficient model. J.Statist. Plann. Inference 136 2506–2534. MR2279819

[28] XUE, L. and YANG, L. (2006). Additive coefficient modeling via polynomial spline. Statist.Sinica 16 1423–1446. MR2327498

[29] YANG, L., HÄRDLE, W. and NIELSEN, J. P. (1999). Nonparametric autoregression with mul-tiplicative volatility and additive mean. J. Time Ser. Anal. 20 579–604. MR1720162

[30] YANG, L., SPERLICH, S. and HÄRDLE, W. (2003). Derivative estimation and testing in gen-eralized additive models. J. Statist. Plann. Inference 115 521–542. MR1985882

DEPARTMENT OF STATISTICS

UNIVERSITY OF GEORGIA

ATHENS, GEORGIA 30602USAE-MAIL: [email protected]

DEPARTMENT OF STATISTICS

AND PROBABILITY

MICHIGAN STATE UNIVERSITY

EAST LANSING, MICHIGAN 48824USAE-MAIL: [email protected]





www.arxiv.org/abs/math/0612677





mailto:[email protected]

mailto:[email protected]

spline-backfitted kernel smoothing of nonlinear additive

Documents