adaptive functional linear regression via functional ...tcai/paper/adaptive-flr.pdf · adaptive...

25
Adaptive Functional Linear Regression via Functional Principal Component Analysis and Block Thresholding * T. Tony Cai 1 , Linjun Zhang 1 , and Harrison H. Zhou 2 University of Pennsylvania and Yale University Abstract Theoretical results in the functional linear regression literature have so far fo- cused on minimax estimation where smoothness parameters are assumed to be known and the estimators typically depend on these smoothness parameters. In this paper we consider adaptive estimation in functional linear regression. The goal is to construct a single data-driven procedure that achieves optimality results si- multaneously over a collection of parameter spaces. Such an adaptive procedure automatically adjusts to the smoothness properties of the underlying slope and co- variance functions. The main technical tools for the construction of the adaptive procedure are functional principal component analysis and block thresholding. The estimator of the slope function is shown to adaptively attain the optimal rate of convergence over a large collection of function spaces. Keywords: Adaptive estimation; Block thresholding; Eigenfunction; Eigenvalue; Func- tional data analysis; Functional principal component analysis; Minimax estimation; Rate of convergence; Slope function; Smoothing; Spectral decomposition. AMS 2000 Subject Classification: Primary 62J05; secondary 62G20. * In Memory of Peter G. Hall. 1 Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104. The research of Tony Cai was supported in part by NSF Grant DMS-1403708 and NIH Grant R01 CA127334. 2 Department of Statistics, Yale University, New Haven, CT 06511. The research of Harrison Zhou was supported in part by NSF Grant DMS-1507511. 1

Upload: vodang

Post on 25-Apr-2018

229 views

Category:

Documents


1 download

TRANSCRIPT

Adaptive Functional Linear Regression via Functional

Principal Component Analysis and Block Thresholding∗

T. Tony Cai1, Linjun Zhang1, and Harrison H. Zhou2

University of Pennsylvania and Yale University

Abstract

Theoretical results in the functional linear regression literature have so far fo-

cused on minimax estimation where smoothness parameters are assumed to be

known and the estimators typically depend on these smoothness parameters. In

this paper we consider adaptive estimation in functional linear regression. The goal

is to construct a single data-driven procedure that achieves optimality results si-

multaneously over a collection of parameter spaces. Such an adaptive procedure

automatically adjusts to the smoothness properties of the underlying slope and co-

variance functions. The main technical tools for the construction of the adaptive

procedure are functional principal component analysis and block thresholding. The

estimator of the slope function is shown to adaptively attain the optimal rate of

convergence over a large collection of function spaces.

Keywords: Adaptive estimation; Block thresholding; Eigenfunction; Eigenvalue; Func-

tional data analysis; Functional principal component analysis; Minimax estimation; Rate

of convergence; Slope function; Smoothing; Spectral decomposition.

AMS 2000 Subject Classification: Primary 62J05; secondary 62G20.

∗In Memory of Peter G. Hall.1Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104.

The research of Tony Cai was supported in part by NSF Grant DMS-1403708 and NIH Grant R01

CA127334.2Department of Statistics, Yale University, New Haven, CT 06511. The research of Harrison Zhou

was supported in part by NSF Grant DMS-1507511.

1

1 Introduction

Due to advances in technology, functional data now commonly arises in many differ-

ent fields of applied sciences including, for example, chemometrics, biomedical studies,

and econometrics. There has been extensive recent research on functional data analysis.

Much progress has been made on developing methodologies for analyzing functional data.

The two monographs by Ramsay and Silverman (2002 and 2005) provide comprehensive

discussions on the methods and applications. See also Ferraty and Vieu (2006).

Among many problems involving functional data, functional linear regression has re-

ceived substantial attention. Consider a functional linear model where one observes a

random sample {(Xi, Yi) : i = 1, ...n} with

Yi = a+

∫ 1

0

Xi(t)b(t)dt+ Zi, (1)

where the response Yi and the intercept a are scalars, the predictor Xi and slope function

b are functions in L2([0, 1]), and the errors Zi are independent and identically distributed

N(0, σ2) variables. The goal is to estimate the slope function b(t) and the intercept a based

on the sample {(Xi, Yi) : i = 1, ...n}. Note that once an estimator b of b is constructed,

the intercept a can be estimated easily by

a = Y −∫ 1

0

X(t)b(t)dt,

where Y and X are the averages of Yi and Xi respectively. We shall thus focus our

discussion in this paper on estimating the slope function b. The slope function is of

significant interest on its own right. For example, knowing where b takes large or small

values provides information about where a future observation x of X will have greatest

leverage on the conditional mean of y given X = x.

The problem of slope-function estimation is intrinsically nonparametric and the con-

vergence rate under the mean integrated squared error (MISE)

R(b, b) = E‖b− b‖22 = E∫ 1

0

(b(t)− b(t))2dt (2)

is typically slower than n−1. Rates of convergence of an estimator b to b have been studied

in, e.g., Ferraty and Vieu (2000); Cuevas et al. (2002); Cardot and Sarda (2006); Li and

Hsing (2007); Hall and Horowitz (2007). In particular, Hall and Horowitz (2007) showed

that the minimax rate of convergence for estimating b under the MISE (2) is determined

by the smoothness of the slope function, and of the covariance function for the distribution

2

of explanatory variables. Cai and Hall (2006) considered a related prediction problem and

Muller and Stadtmuller (2005) studied generalized functional linear models.

The theory on slope function estimation has so far focused on the minimax estimation

where these smoothness parameters are assumed to be known. The estimators typically

depend on the smoothness parameters. Although minimax risk provides a useful uni-

form benchmark for the comparison of estimators, minimax estimators often require full

knowledge of the parameter space which is unknown in practice. A minimax estimator

designed for a specific parameter space typically performs poorly over another parameter

space. This makes adaptation essential for functional linear regression.

In the present paper we consider adaptive estimation of the slope function b. The goal

is to construct a single data-driven procedure that achieves optimality results simultane-

ously over a collection of parameter spaces. Such an adaptive procedure does not require

the knowledge of the parameter space and automatically adjusts to the smoothness prop-

erties of the underlying slope and covariance functions. In Section 2, we construct a

procedure for estimating the slope function b using functional principal component anal-

ysis (PCA) and block thresholding. The estimator is shown to adaptively achieve the

optimal rate of convergence simultaneously over a collection of function classes.

The main technical tools are functional principal component analysis (PCA) and block

thresholding. Functional PCA is a convenient and commonly used technique in functional

data analysis. See, e.g., Ramsay and Silverman (2002 and 2005). Block thresholding was

first developed in nonparametric function estimation. It increases estimation precision

and achieves adaptivity by utilizing information about neighboring coordinates. The idea

of block thresholding can be traced back to Efromovich (1985) in estimating a density

function using the trigonometric basis. It is further developed in wavelet function estima-

tion. See Hall, Kerkyacharian and Picard (1998) for density estimation and Cai (1999) for

nonparametric regression. Cai, Low and Zhao (2009) used weakly geometrically growing

block size for sharp adaptation over ellipsoids in the context of the white noise model. In

this paper we shall follow the ideas in Cai, Low and Zhao (2009) and use weakly geomet-

rically growing block size for adaptive functional linear regression. Our results show that

block thresholding naturally connects shrinkage rules developed in the classical normal

decision theory with functional linear regression.

The proposed block thresholding procedure is easily implementable. A simulation

study is carried out to investigate its numerical performance. In particular, we compare

its finite-sample properties with those of the non-adaptive procedure introduced in Hall

and Horowitz (2007). The results demonstrate the advantage of the proposed procedure.

3

The paper is organized as follows. In Section 2, after basic notations and facts on the

spectral decomposition of the covariance function are reviewed, the block thresholding

procedure for estimating the slope function b is defined in Section 2.2. Section 3 inves-

tigates the theoretical properties of the block thresholding procedure. It is shown that

the estimator enjoys a high degree of adaptivity. Section 4 discusses the numerical per-

formance of the proposed estimator and shows the advantage of the adaptive procedure.

The proofs are given in Section 5.

2 Methodology

Estimating the slope function b in function linear regression involves solving an ill-posed

inverse problem. The main difference with the conventional linear inverse problems is

that the operator is not given in the functional linear regression. A major technical

step in the construction of the slope function estimator is to estimate the eigenvalues

and eigenfunctions of the unknown linear operator and to bound the errors between the

estimates and the estimands. Necessary technical tools for slope function estimation

include functional analysis and statistical smoothing. Specifically, our estimator is based

on the functional principal component analysis and block thresholding techniques. In this

section we will begin with spectral decomposition of the covariance function in terms of

eigenvalues and eigenfunctions. We then introduce in Section 2.2 a blockwise James-Stein

procedure to estimate the slope function b.

2.1 Spectral decomposition

Suppose we observe a random sample {(Xi, Yi) : i = 1, ...n} as in (1). Let (X, Y, Z)

denote a generic (Xi, Yi, Zi). Define the covariance function and the empirical covariance

function respectively as

K(u, v) = cov{X(u), X(v)}

K(u, v) =1

n

n∑i=1

{Xi(u)− X(u)}{Xi(v)− X(v)}

where X = 1n

∑Xi. The covariance function K defines a linear operator which maps a

function f to Kf given by (Kf)(u) =∫K(u, v)f(v)dv. We shall assume that the linear

operator with kernel K is positive definite.

4

Write the spectral decompositions of the covariance functions K and K as

K(u, v) =∞∑j=1

θjφj(u)φj(v), K(u, v) =∞∑j=1

θjφj(u)φj(v), (3)

where

θ1 > θ2 > ... > 0, and θ1 ≥ θ2 ≥ ... ≥ θn+1 = . . . = 0 (4)

are respectively the ordered eigenvalue sequences of the linear operators with kernels K

and K, and {φj} and {φj} are the corresponding orthonormal eigenfunction sequences.

The sequences {φj} and {φj} each forms an orthonormal basis in L2([0, 1]).

The functional linear model (1) can be rewritten as

Yi = µ+

∫[Xi − E(X)] b+ Zi, i = 1, 2, ..., n (5)

where µ = E(Yi) = a + E∫Xb. The Karhunen-Loeve expansion of the random function

Xi − EX is given by

Xi − EX =∞∑j=1

xi,jφj (6)

where the random variable xi,j =∫

(Xi−EX)φj has mean zero and variance Var(xi,j) = θj.

In addition, the random variables xi,j are uncorrelated. Expand the slope function b in

the orthonormal basis {φj} as b =∑∞

j=1 bjφj. Then the model (5) can be written as

Yi = µ+∞∑j=1

xi,jbj + Zi, i = 1, 2, ..., n (7)

and the problem of estimating the slope function b is transformed into the one of estimat-

ing the coefficients {bj} as well as the eigenfunctions {φj}. Note that in (7) µ and xi,j are

unknown, and thus need to be estimated from the data.

The mean µ of Y can be estimated easily by the sample mean µ = Y . To estimate

the xi,j, we expand Xi − X in the orthonormal basis {φj} as

Xi − X =n∑j=1

xi,jφj for i = 1, 2, ..n (8)

where the random variables xi,j =∫

(Xi − X)φj. Note that

n∑i=1

xi,j =n∑i=1

∫(Xi − X)φj =

∫ [ n∑i=1

(Xi − X)

]φj = 0

5

and1

n

n∑i=1

xi,jxi,k =

∫ ∫K(u, v)φj(u)φk(v) = θjδj,k (9)

for all j and k, where δj,k is the Kronecker delta with δj,k = 1 if j = k and 0 otherwise.

Since Y = a+∫ 1

0X(t)b(t)dt+ Z, we have

Yi − Y =

∫ [Xi − X

]b+ Zi − Z, i = 1, 2, ..., n.

Hence

Yi − Y =n∑j=1

xi,j bj + Zi − Z, i = 1, 2, ..., n (10)

where bj =∫bφj, and consequently b =

∑∞j=1 bjφj. Since the slope function b is unknown,

the coefficients bj are also unknown and need to be estimated. A typical principal com-

ponents regression approach is to replace “n” in equation (10) by a constant m < n and

estimate bj by ordinary least squares.

Since the “predictors” (xi,j)1≤j≤n in equation (10) are orthogonal to each other and∑ni=1 x

2i,j = θjn from equation (9), for θj 6= 0 we may estimate bj (or bj) by

bj = θ−1j n−1n∑i=1

(Yi − Y )xi,j = θ−1j n−1n∑i=1

(Yi − Y )

∫ [Xi (u)− X (u)

]φj (u) (11)

= θ−1j

∫g(u)φj (u) = θ−1j gj

where

g(u) = n−1n∑i=1

(Yi − Y )[Xi(u)− X(u)

]and gj =

∫gφj. (12)

It is expected that g is approximately

g(u) = E [(Y − µ) (X(u)− E(X (u))] =

∫ ∫K (u, v) b (v) dv.

Write g =∑∞

j=1 gjφj. It is easy to check that bj = θ−1j gj. So the estimator bj = θ−1j gj

in equation (11) can be regarded as an empirical version of the true coefficient bj. We

shall construct an adaptive estimator of bj based on the empirical coefficients bj by using

a block thresholding technique.

2.2 A Block Thresholding Procedure

Block thresholding techniques have been well developed in nonparametric function esti-

mation literature. See, e.g., Efromovich (1985), Hall, Kerkyacharian and Picard (1998)

6

and Cai (1999). In this paper we shall use a block thresholding method with weakly

geometrically growing block size for adaptive functional linear regression. This method

was used in Cai, Low and Zhao (2009) for sharp adaptive estimation over ellipsoids in the

classical white noise model.

The block thresholding procedures work especially well with homoscedastic Gaussian

data. However, in the current setting the empirical coefficients bj are heteroscedastic

with growing variances. We will see in Lemma 3 in Section 5 that the variance of bj

is approximately σ2θ−1j n−1, getting large as j increases. We shall thus rescale the bj to

stabilize the variances.

With the notation introduced above the block thresholding procedure can then be

described in detail as follows. Let

m∗ = arg min{m : θm/θ1 ≤ n−1/3

}. (13)

It will be shown in Section 5 that there is no need ever to go beyond the m∗-th term

under certain regularity conditions. We define

gj =

{gj j < m∗

0 otherwise

and set

dj = θ− 1

2j gj and dj = θ

− 12

j gj. (14)

Lemma 4 in Section 5 shows that the variance Var(dj) = σ2

n(1+o(1)) and so dj are nearly

homoscedastic. We shall apply a blockwise James-Stein procedure to dj to construct an

estimator dj of dj and then estimate the bj by bj = θ− 1

2j dj.

The block thresholding procedure for estimating the slope function b has three steps.

1. Divide the indices {1, 2, ..., m∗} into nonoverlapping blocks B1, B2, ..., BN with

Card(Bi) =⌊(1 + 1/ log n)i+1

⌋.

2. Apply a blockwise James-Stein rule to each block. For all j ∈ Bi, set

dj = (1− 2Liσ2

nS2i

)+ · dj, (15)

where S2i =

∑j∈Bi d

2j and Li = Card(Bi).

3. Set bj = θ− 1

2j dj. The estimator of b is then given by

b(u) =m∗∑j=1

bjφj(u) =m∗∑j=1

ρj bjφj(u) (16)

7

where ρj = (1− 2Ljσ2

nS2i

)+ for all j ∈ Bi is the shrinkage factor.

The block thresholding procedure given above is purely data-driven and is easy to

implement. In particular it does not require the knowledge of the rate of decay of the

eigenvalues θj or the coefficients bj of the slope function b. In contrast, the minimax rate

optimal estimator given in Hall and Horowitz (2007) critically depends on the rates of

decay of θj and bj.

Remark 1 We have used the blockwise James-Stein procedure in (15) because of its

simplicity. In addition to the James-Stein rule, other shrinkage rules such as the blockwise

hard thresholding rule

dj = dj · I(S2i ≥ λLiσ

2/n)

can be used as well.

Remark 2 In the procedure we assume σ is known, since it can be estimated easily. In

equation (10), we may apply principal components regression by replacing “n” in equation

(10) with a constant m = log2 n. Let bj be the ordinary least squares estimate of bj. It can

be shown easily that∑m

j=1 bjφj is a consistent estimate of b. Then we obtain a consistent

estimate of σ2 with

σ2 =1

n

n∑i=1

(Yi − Y −

m∑j=1

xi,jbj

)2

.

3 Theoretical Properties

We now turn to the asymptotic properties of the block thresholding procedure for the

functional linear regression under the mean integrated squared error (2). The theoret-

ical results show that the block thresholding estimator given in (16) adaptively attains

the exact minimax rate of convergence simultaneously over a large collection of function

spaces.

In this section we shall begin by considering adaptivity of the block thresholding

estimator over the following function spaces which have been considered by Cai and Hall

(2006) and Hall and Horowitz (2007) in the contexts of prediction and slope function

estimation. These function classes arise naturally in functional linear regression based on

functional principal component analysis. For more details, see Cai and Hall (2006) and

Hall and Horowitz (2007). See also Hall and Hosseini-Nasab (2006).

8

Let β > 0 and M∗ > 0 be constants. Define the function class for b by

Bβ(M∗) =

{b =

∞∑j=1

bj φj, with |bj| ≤M∗ j−β for j = 1, 2, ...

}. (17)

We can interpret this as a “smoothness class” of functions, where the functions become

“smoother” (measured in the sense of generalized Fourier expansions in the basis {φj})as β increases. We shall also assume the eigenvalues satisfy

M−10 j−α ≤ θj ≤M0 j

−α, θj − θj+1 ≥M−10 j−α−1 for j = 1, 2, .... (18)

This condition is assumed such that we may possibly obtain a reasonable estimate of

the corresponding eigenfunction of θj. Our adaptivity result also requires the following

condition on X. The process X is assumed to be left continuous (or right-continuous) at

each point and that for each k > 0 and some ε > 0

supt

E{|X (t)|k

}< Mk and sup

s,tE{|s− t|−ε |X (t)−X (s)|k

}< Mk,ε (19)

and for each r ≥ 1,

supj≥1

θ−rj E[∫

(X − EX)φj

]2r≤M ′

r (20)

for some constant M ′r > 0.

Let F (α, β,M) denote the set of distributions F of (X, Y ) that satisfies (17) - (20)

with M = {M∗,M0,Mk,Mk,ε,M′r}. The minimax rate of convergence for estimating the

slope function b over these smoothness classes has been derived by Hall and Horowitz

(2007). It is shown that the minimax risk satisfies

infb

supF(α,β,M)

E‖b− b‖22 � n−2β−1α+2β . (21)

The rate-optimal procedure given in Hall and Horowitz (2007) is based on frequency cut-

off. Their estimator is not adaptive; it requires the knowledge of α and β. The following

result shows that the block thresholding estimator b given in (16) is rate optimally adaptive

over the collection of parameter spaces.

Theorem 1 Under the conditions (17) - (20) the block thresholding estimator b given in

(16) satisfies, for all 2 < α < β,

supF(α,β,M)

E‖b− b‖22 ≤ Dn−2β−1α+2β (22)

for some constant D > 0.

9

In addition to the function classes defined in (17), one can also consider adaptivity of

the estimator b over other function classes. For example, consider the following function

classes with a Sobolev-type constraint:

Sβ(M∗) =

{b =

∞∑j=1

bj φj, with∞∑j=1

j2β−1b2j ≤M∗ for j = 1, 2, ...

}.

Let F1 (α, β,M) denote the set of distributions of (X, Y ) that satisfies (18) - (20) and

b ∈ Sβ(M∗).

Theorem 2 Under assumptions (18) - (20), the estimator b given in (16) satisfies, for

all 2 < α < β,

supF1(α,β,M)

E‖b− b‖22 ≤ Dn−2β−1α+2β . (23)

for some constant D > 0.

The proof of Theorem 2 is similar to the one for Theorem 1 with some minor modifi-

cations.

Remark 3 Theorems 1 and 2 remain true if the shrinkage factor ρj in (16) is replaced

by ρj = (1− λLjσ2

nS2i

)+ for any constant λ > 1.

Remark 4 We have so far focused on block thresholding. A simpler term-by-term thresh-

olding rule can be used to yield a slightly weaker result. Let bj = θ−1j gj as in (11). Set

bj =

{sgn(bj)(|bj| − σ

√2 logn

nθj)+ for 1 ≤ j ≤ m∗

0 for j > m∗. (24)

Note that this estimator is equivalent to setting

dj =

{sgn(dj)(|bj| − σ

√2 lognn

)+ for 1 ≤ j ≤ m∗

0 for j > m∗(25)

and bj = θ− 1

2j dj. Now let bt(u) =

∑m∗

j=1 bjφj(u) with bj given in (24). Then under the

conditions of Theorem 1, we have

supF(α,β,M)

E‖bt − b‖22 ≤ C

(log n

n

) 2β−1α+2β

(26)

for some constant C > 0. In other words, the term-by-term thresholding estimator bt is

simultaneously with a logarithmic factor of the minimax risk over a collection of function

classes. The same result holds with F(α, β,M) replaced by F1(α, β,M) in (26).

10

4 Numerical Properties

The block thresholding procedure proposed in Section 2.2 is easy to implement. In this

section, we investigate its numerical performance through a simulation study in two set-

tings. In particular, we compare its finite-sample properties with those of the non-adaptive

procedure introduced in Hall and Horowitz (2007).

In the first setting, the predictor Xi’s are observed continuously, and they are inde-

pendently distributed as

X =∑j

γjWjφj(t), for t ∈ [0, 2],

where γj = (−1)j+1j−α/2, {φj}∞j=1 = {1, sin(πt), cos(πt), sin(2πt), cos(2πt), ...} is the

Fourier series with period 2, and Wj’s are i.i.d. standard normal variables. In addition,

the slope function b is taken to be b(t) =∑

j j−βφj(t) and the errors Zi are distributed

as normal N(0, σ2). To show the advantage of the adaptive procedure, we compare our

estimator b with that of Hall and Horowitz (2007), denoted by bH,m, which was shown

to be minimax optimal but not adaptive. In the construction of their estimator bH,m,

one needs to specify the optimal cut-off index m, which requires the knowledge of the

smoothing parameters that are usually unknown in practice. We compare the MISE (2)

between our adaptive procedure and their method with different values of m chosen from

{1, 2, ..., 8}. Similar to the setting in Hall and Horowitz (2007), we take a range of values

for (σ, n, α, β) in our simulation study. Specifically, (σ, n, α, β) was chosen from the set

{1, 2} × {100, 500} × {1.1, 1.5, 2.0} × {2, 2.5}. All the results are based on 200 Monte

Carlo replications for each parameter setting and the MISEs of different procedures are

recorded in Table 1.

The choices of X, b and parameters (σ, n, α, β) in the second setting are the same as

those for the first, except that each Xi was observe discretely on an equally spaced grid

of 41 points on [0, 2], with i.i.d. additive N(0, 4) random noise. We use Fourier basis

smoother to estimate functions Xi’s from the discrete data. Table 2 summarizes the

averaged MISEs of the proposed block thresholding method and the method of Hall and

Horowitz (2007) with different cut-off index m, computed by averaging over 200 Monte

Carlo simulations. It is clear from these results that both procedures are robust against

discretization and random errors.

The results in both Table 1 and Table 2 show that the MISEs of the method proposed

by Hall and Horowitz (2007) are sensitive to the choice of cut-off index m. The optimal

choice of m is usually unknown in practice and needs to be tuned via empirical method

11

such as cross validation. In contrast, our proposed method is purely data-driven and

adaptive to different degrees of smoothness. Both tables demonstrate the advantage of

the proposed adaptive procedure. The performance of the proposed adaptive method is

as good as, if not better than, the method of Hall and Horowitz (2007) with the optimally

chosen m.

5 Proofs

We shall only prove Theorem 1. The proof of Theorem 2 is similar and thus omitted.

Before we present the proof of the main result, we first collect a few technical lemmas.

These auxiliary lemmas will be proved in Section 5.3. We sharpen some results in Hall

and Horowitz (2007) and give a risk bound for a blockwise James-Stein estimator. In this

section we shall denote by C a generic constant which may vary from place to place.

5.1 Technical lemmas

It was proposed in Hall and Horowitz (2007) to estimate b by∑m

j=1 bjφj with a choice

of cutoff m = n1

α+2β to obtain minimax rate of convergence. The lemma below explains

why there is no need ever to go beyond the m∗-th term in defining the block thresholding

procedure (16).

Lemma 1 Let γ and γ1 be constants satisfying 1α+2β

< γ < 13α< γ1 For all D > 0, there

exists a constant CD such that

P (nγ ≤ m∗ ≤ nγ1) ≥ 1− cDn−D

where m∗ is defined in (13).

In this section we set

1

α + 2β< γ < min

{1 + ε

α + 2β,

1

},

1

3α< γ1 <

1

2 (α + 1)(27)

for a small 0 < ε < min{α−23, 2β−α3α+1

}. We give upper bounds to approximate eigenfunction

φj by empirical eigenfunction φj for j ≤ nγ1 .

Lemma 2 For all j ≤ nγ1, we have

nE∥∥∥φj − φj∥∥∥2 ≤ Cj2

12

n σ α β MISE(b) MISE(bH,m)

m = 1 m = 2 m = 3 m = 4 m = 5 m = 6 m = 7 m = 8

100 1 1.1 2 0.032 0.113 0.053 0.038 0.033 0.031 0.032 0.034 0.038

2.5 0.027 0.080 0.039 0.029 0.028 0.028 0.030 0.034 0.038

1.5 2 0.024 0.097 0.034 0.023 0.021 0.024 0.032 0.046 0.059

2.5 0.016 0.055 0.019 0.015 0.017 0.025 0.033 0.044 0.058

2 2 0.027 0.092 0.029 0.019 0.023 0.039 0.063 0.097 0.133

2.5 0.014 0.044 0.014 0.013 0.024 0.038 0.060 0.093 0.127

2 1.1 2 0.046 0.149 0.058 0.047 0.049 0.059 0.077 0.097 0.118

2.5 0.038 0.080 0.041 0.041 0.047 0.060 0.074 0.094 0.113

1.5 2 0.042 0.102 0.045 0.037 0.056 0.076 0.114 0.170 0.220

2.5 0.029 0.059 0.028 0.033 0.050 0.073 0.108 0.160 0.212

2 2 0.042 0.087 0.036 0.045 0.069 0.129 0.227 0.354 0.500

2.5 0.028 0.046 0.021 0.038 0.088 0.145 0.226 0.340 0.552

500 1 1.1 2 0.007 0.091 0.027 0.014 0.009 0.008 0.007 0.007 0.008

2.5 0.005 0.045 0.012 0.007 0.006 0.005 0.006 0.006 0.007

1.5 2 0.007 0.084 0.022 0.010 0.007 0.007 0.008 0.009 0.012

2.5 0.004 0.038 0.008 0.005 0.004 0.004 0.006 0.007 0.104

2 2 0.010 0.087 0.022 0.010 0.008 0.010 0.012 0.016 0.024

2.5 0.005 0.038 0.007 0.004 0.005 0.008 0.013 0.018 0.024

2 1.1 2 0.014 0.093 0.028 0.016 0.014 0.015 0.017 0.021 0.026

2.5 0.009 0.045 0.013 0.009 0.010 0.011 0.013 0.016 0.021

1.5 2 0.014 0.089 0.024 0.014 0.013 0.018 0.023 0.035 0.043

2.5 0.008 0.042 0.010 0.008 0.010 0.014 0.021 0.030 0.043

2 2 0.017 0.083 0.023 0.015 0.017 0.025 0.043 0.079 0.118

2.5 0.009 0.041 0.009 0.009 0.015 0.026 0.041 0.068 0.097

Table 1: Comparison of MISE in the continuous X case

13

n σ α β MISE(b) MISE(bH,m)

m = 1 m = 2 m = 3 m = 4 m = 5 m = 6 m = 7 m = 8

100 1 1.1 2 0.036 0.131 0.059 0.042 0.036 0.035 0.035 0.038 0.043

2.5 0.030 0.088 0.040 0.031 0.031 0.033 0.034 0.038 0.042

1.5 2 0.032 0.097 0.040 0.031 0.029 0.033 0.042 0.053 0.063

2.5 0.021 0.057 0.024 0.020 0.021 0.028 0.037 0.045 0.055

2 2 0.030 0.101 0.034 0.026 0.028 0.036 0.048 0.065 0.086

2.5 0.017 0.048 0.017 0.018 0.028 0.039 0.052 0.074 0.100

2 1.1 2 0.052 0.124 0.064 0.053 0.054 0.064 0.075 0.087 0.108

2.5 0.044 0.091 0.050 0.043 0.047 0.054 0.068 0.086 0.111

1.5 2 0.048 0.099 0.046 0.043 0.056 0.077 0.109 0.139 0.187

2.5 0.036 0.061 0.032 0.039 0.050 0.070 0.096 0.133 0.173

2 2 0.046 0.102 0.042 0.046 0.074 0.116 0.189 0.257 0.350

2.5 0.031 0.053 0.029 0.043 0.072 0.122 0.180 0.250 0.335

500 1 1.1 2 0.008 0.094 0.027 0.014 0.010 0.009 0.008 0.009 0.010

2.5 0.006 0.044 0.013 0.008 0.007 0.006 0.007 0.007 0.008

1.5 2 0.009 0.090 0.026 0.013 0.009 0.009 0.009 0.010 0.012

2.5 0.005 0.043 0.010 0.006 0.005 0.006 0.007 0.009 0.011

2 2 0.012 0.086 0.023 0.011 0.010 0.010 0.013 0.018 0.022

2.5 0.006 0.040 0.009 0.005 0.006 0.009 0.012 0.017 0.022

2 1.1 2 0.015 0.090 0.028 0.017 0.014 0.015 0.017 0.020 0.024

2.5 0.010 0.047 0.014 0.010 0.010 0.012 0.015 0.019 0.023

1.5 2 0.016 0.089 0.025 0.016 0.014 0.017 0.022 0.029 0.037

2.5 0.009 0.043 0.011 0.009 0.011 0.017 0.024 0.032 0.040

2 2 0.018 0.085 0.025 0.017 0.019 0.029 0.041 0.058 0.076

2.5 0.010 0.041 0.010 0.010 0.014 0.021 0.031 0.046 0.063

Table 2: Comparison of MISE in the discrete X case

14

and for any given 0 < δ < 1 and for all D > 0 there exists a constant CD > 0 such that

P{n1−δ

∥∥∥φj − φj∥∥∥2 ≥ Cj2}≤ CDn

−D.

Lemma 3 gives a variance bound for bj, which helps us show that the variance of dj

is approximately σ2

n. This result is crucial for proposing a practical block thresholding

procedure.

Lemma 3 For j ≤ nγ1 with γ1 <1

2(α+1),

E(bj − bj

)2 ≤ Cj2/n.

In particular, this implies Var(bj) ≤ Cj2/n and Var(bj) = σ2θ−1j n−1 (1 + o (1)).

The following lemma gives bounds for the variance and mean squared error of dj.

Lemma 4 For j ≤ nγ1 with γ1 <1

2(α+1),

Var(dj) =σ2

n(1 + o (1)) and E

(dj − θ

12j bj

)2≤ Cn−1j2−α.

The following two lemmas will be used to analyze the factor ρj in equation (16).

Lemma 5 Let nγ ≤ m1 ≤ m2 ≤ nγ1 and m2 − m1 ≥ nδ for some δ > 0. Define

S2 =∑m2

j=m1d2j . For any given ε > 0 and all D > 0 there exists a constant CD > 0 such

that

P(S2 > (1 + ε) (m2 −m1)σ2

n) ≤ CDn

−D.

Lemma 6 Let dj = d′j + εj where d′j = E(dj). Let ε > 0 be a fixed constant. If the block

size Li = Card(Bi) ≥ nδ for some δ > 0, then for any D > 0, there exists a constant

CD > 0 such that

P(∑j∈Bi

ε2j > (1 + ε)Liσ2

n) ≤ CDn

−D. (28)

And for all blocks Bi,

E∑j∈Bi

ε2j ≤ CLiσ2

n. (29)

Conventional oracle inequalities were derived for Gaussian errors. In the current set-

ting the errors are non-Gaussian. The following lemma gives an oracle inequality for a

block thresholding estimator in the case of general error distributions. See Brown, Cai,

Zhang, Zhao and Zhou (2010) for a proof.

15

Lemma 7 Suppose yi = θi + εi, i = 1, ..., L, where θi are constants and Zi are random

variables. Let S2 =∑L

i=1 y2i and let

θi = (1− λL

S2)+yi.

Then

E‖θ − θ‖22 ≤ min{‖θ‖22, 4λL}+ 4E‖ε‖22I(‖ε‖22 > λL). (30)

5.2 Proof of Theorem 1

We shall prove Theorem 1 for a general block thresholding estimator with the shrinkage

factor ρj = (1− λLjσ2

nS2i

)+ for a constant λ > 1.

Let γ and γ1 be constants satisfying

1

α + 2β< γ < min

{1 + ε

α + 2β,

1

}≤ 1

3α< γ1 <

1

2 (α + 1)

for a small ε > 0. Let m∗ = nγ and write b as

b(u) =m∗∑j=1

ρj bjφj(u) +n∑

j=m∗+1

ρj bjφj(u). (31)

We shall show that E‖b− b‖22 ≤ Cn−2β−1α+2β . Note that

E‖b− b‖22 = E‖m∗∑j=1

bjφj(u) +n∑

j=m∗+1

bjφj(u)−m∗∑j=1

bjφj(u)−∞∑

j=m∗+1

bjφj(u)‖22

≤ 3E‖m∗∑j=1

bjφj(u)−m∗∑j=1

bjφj(u)‖22 + 3n∑

j=m∗+1

E(b2j) + 3∞∑

j=m∗+1

b2j . (32)

The last term (32) is bounded by Cn−γ(2β−1) = o(n−(2β−1)/(α+2β)

)since γ > 1

α+2β. We

first show that the second term (32) is small as well. Let m∗ = nγ1 and let i∗ and i∗ be

the corresponding block indices of the (m∗ + 1)-st and m∗-th term respectively. (That is,

bm∗+1 is in the i∗-th block and bm∗ is in the i∗-th block.) Then it follows from Lemmas 1

and 5 that

n∑j=m∗+1

E(b2j) =

(m∗∑

j=m∗+1

+n∑

j=m∗+1

)E(ρ2j b

2j)

≤m∗∑

j=m∗+1

(Eρ4j)12 (Eb4j)

12 +

n∑j=m∗+1

(Eb4j)12P

12 (m∗ ≥ nγ1 + 1)

16

≤i∗∑i=i∗

[P(S2

i > λLσ2/n)]1/2 ∑

j∈Bi

(Eb4j)12 +

n∑j=m∗+1

(Eb4j)12 [P (m∗ ≥ nγ1 + 1)]1/2

= o(n−

2β−1α+2β

).

We now turn to the first and dominant term in (32). The Cauchy-Schwarz inequality

yields

E‖m∗∑j=1

bjφj(u)−m∗∑j=1

bjφj(u)‖22 ≤ 2E‖m∗∑j=1

(bj − bj)φj(u)‖22 + 2E‖m∗∑j=1

bj(φj(u)− φj(u))‖22

≤ 2m∗∑j=1

E(bj − bj)2 + 2m∗

m∗∑j=1

b2jE‖φj(u)− φj(u)‖22.

Lemma 2 implies the second term in the equation above is bounded by

Cm∗n

m∗∑j=1

b2jj2 = O

(nγ−1

)= o

(n−

2β−1α+2β

)since

∑m∗j=1 b

2jj

2 is finite and γ < α+1α+2β

which implies γ − 1 < − 2β−1α+2β

. Set d′j = E(dj). Let

κi be the smallest eigenvalue in the Bi-th block. Then

m∗∑j=1

E(bj − bj)2 =m∗∑j=1

E(θ− 1

2j dj − θ

− 12

j dj)2 ≤ 2

m∗∑j=1

θ−1j E(dj − dj)2 + 2m∗∑j=1

E[d2j(θ

− 12

j − θ− 1

2j )2

]≤ 2

m∗∑j=1

θ−1j E(dj − dj)2 + 2m∗∑j=1

E[d2j(θ

− 12

j − θ− 1

2j )2

]≤ 2

i∗∑i=1

κ−1i∑j∈Bi

E(dj − d′j)2 + 2i∗∑i=1

κ−1i∑j∈Bi

(d′j − dj)2 + 2m∗∑j=1

E[d2j(θ

− 12

j − θ− 1

2j )2

]≡ T1 + T2 + T3.

From equations (34) and (35) and Lemma 4, it is easy to see

T3 ≤ C

m∗∑j=1

E{d2jθ−3j (θj − θj)2} = o(n−

2β−1α+2β

).

We now turn to the dominant term T1 + T2. This term is most closely related to the

block thresholding rule and we need to show that T1 + T2 ≤ Cn−2β−1α+2β . To bound T1, it

is necessary to analyze the risk of the block thresholding rule for a single block Bi. It

follows from Lemma 7 that∑j∈Bi

E(dj − d′j)2 ≤ min{4λLiσ2/n,∑j∈Bi

(d′j)2}+ 4E{(

∑j∈Bi

ε2j) · I(∑j∈Bi

ε2j > λLiσ2/n)} (33)

17

where λ > 1 is a constant. Lemma 4 implies(d′j − θ

12j bj

)2≤ Cn−1j2−α.

Note that for all j in Bi, we have θ−1j � κ−1i . Hence for m∗ = nγ with γ < 1+εα+2β

we have

T2 ≤ Cm∗∑j=1

θ−1j n−1j2−α ≤ C1

n

(1 +m3

∗)

= o(n−

2β−1α+2β

)Let m = n

1α+2β , then equation (33) and Lemma 6 give

T1 ≤ C

m∑j=1

n+ C

m∗∑j=m+1

[θ−1j ·

(θ1/2j bj

)2+ θ−1j n−1j2−α

]+ C/n ≤ C1n

− 2β−1α+2β .

These together imply E‖b− b‖22 ≤ Cn−2β−1α+2β .

5.3 Proof of auxiliary lemmas

Let ∆2 =∥∥∥K −K∥∥∥2 =

∫ ∫ (K (u, v)−K (u, v)

)2dudv and τj = mink≤j (θk − θk+1) . It

is known in Bhatia, Davis and McIntosh (1983) that

supj

∣∣∣θj − θj∣∣∣ ≤ ∆, supj≥1

τj

∥∥∥φj − φj∥∥∥ ≤ 81/2∆. (34)

For ε > 0, it was shown in Hall and Hosseini-Nasab (2006, Lemma 3.3)

P(∆ > nε−1/2

)= cDn

−D (35)

for each D > 0 under the assumption (19).

It is useful to rewrite bj as

bj = θ−1j gj = θ−1j

∫1

n

n∑i=1

(Yi − Y ){Xi(u)− X(u)}φj (u)

= θ−1j

∫1

n

n∑i=1

(⟨Xi − X, b

⟩+ Zi − Z

){Xi(u)− X(u)}φj (u)

= bj + θ−1j1

n

∫(X−X)′φj · (Z − Z) = bj + θ−1j

1

nx′·,j(Z − Z).

Using the fact that for any two random variables X and Y , Var(Y ) = E(Var(Y |X)) +

Var(E(Y |X)) and the facts that Z has mean zero and is independent of X, we have

Var(bj) = Var(bj) +σ2

n2

n∑i=1

E(θ−2j x2i,j) = Var(bj) +σ2

nEθ−1j .

18

5.3.1 Proof of Lemma 1

Recall that m∗ = arg min{m : θm/θ1 ≤ n−1/3

}. Note that θj ≥M−1

0 j−α. Since γ satisfies1

α+2β< γ < 1

3α, then for m ≤ nγ we have θm ≥M−1

0 n−αγ. Since αγ < 1/3, the equations

(34) and (35) imply that for any D > 0 there exists a constant CD > 0 such that

P(∪nγm=1

{θm/θ1 ≤ n−1/3

})≤ cDn

−D

and hence

P (m∗ ≤ nγ) ≤ cDn−D, i.e., P (m∗ ≥ nγ) ≥ 1− cDn−D.

Similarly, for m ≥ nγ1 we have

θm ≤M0n−γ1α

with αγ1 > 1/3, then

P(∪n≥m≥nγ1

{θm/θ1 > n−1/3

})≥ cDn

−D

and hence

P (m∗ ≥ nγ1) ≤ cDn−D, i.e., P (m∗ ≤ nγ1) ≥ 1− cDn−D.

Thus we have

P (nγ1 ≥ m∗ ≥ nγ) ≥ 1− cDn−D.

5.3.2 Proof of Lemma 2

Let Fj ={

12|θj − θk| ≤

∣∣∣θj − θk∣∣∣ ≤ 2 |θj − θk| , k 6= j}

, j ≤ nγ1 . From the assumption

(18) we have |θj − θk| ≥M−10 n−(α+1)γ1 with (α + 1) γ1 <

12. Then equations (34) and (35)

imply that for any D > 0 there exists a constant CD > 0 such that for j ≤ nγ1

P(F cj)≤ cDn

−D (36)

and consequently

P(∪j≤nγ1 ,k 6=j

{1

2|θj − θk| ≤

∣∣∣θj − θk∣∣∣ ≤ 2 |θj − θk|})≥ 1− cDn−D. (37)

Note that

φj − φj =∑k

φk

∫ (φj − φj

)φk =

∑k:k 6=j

φk

∫φjφk + φj

∫ (φjφj − 1

).

The facts∫K(u, v)φj (u) du = θjφj (v) and

∫K(u, v)φk (v) dv = θkφk (u) imply∫

φjφk =(θj − θk

)−1 ∫ ∫K(u, v)−K(u, v)φj (u)φk (v) dudv.

19

Now it follows from the elementary inequality 1 − x ≤√

1− x ≤ 1 − x/2 for 0 ≤ x ≤ 1

(we assume that∫φjφj ≥ 0 WLOG) that

1−∑k 6=j

[∫φjφk

]2≤∫φjφj =

√√√√1−∑k 6=j

[∫φjφk

]2≤ 1− 1

2

∑k 6=j

[∫φjφk

]2.

Then we have∥∥∥φj − φj∥∥∥2 ≤ 2∑k:k 6=j

[(θj − θk

)−1 ∫ ∫ (K(u, v)−K(u, v)

)φj (u)φk (v) dudv

]2which on Fj is further bounded by

8∑k:k 6=j

[(θj − θk)−1

∫ ∫ (K(u, v)−K(u, v)

)φj (u)φk (v) dudv

]2

≤ 16∑k:k 6=j

(θj − θk)−2[∫ ∫ (

K(u, v)−K(u, v))(

φj (u)− φj (u))φk (v) dudv

]2+[∫ ∫ (

K(u, v)−K(u, v))φj (u)φk (v) dudv

]2

≤ Cn2γ1(α+1)∆2∥∥∥φj − φj∥∥∥2 + 16

∑k:k 6=j

(θj − θk)−2[∫ ∫ (

K(u, v)−K(u, v))φj (u)φk (v) dudv

]2.

This implies for each D > 0

P

(1

2

∥∥∥φj − φj∥∥∥2 ≤ 16∑k:k 6=j

(θj − θk)−2[∫ ∫ (

K(u, v)−K(u, v))φj (u)φk (v) dudv

]2)≥ 1−cDn−D.

Let ηi,j =∫Xiφj and ηj = 1

n

∑i ηi,j, then

Xi − X =∞∑j=1

(ηi,j − ηj

)φj.

Assume without loss of generality that EX = 0 and for k 6= j write∫ ∫ [K(u, v)−K(u, v)

]φj (u)φk (v) dudv =

1

n

n∑i=1

(ηi,j − ηj

)(ηi,k − ηk) =

1

n

n∑i=1

ηi,jηi,k−ηkηj

where 1n

∑ni=1 ηi,jηi,k is the dominating term. From the assumption (20) we have

E

(1

n

n∑i=1

ηi,jηi,k

)2

≤ n−1E (η1,jη1,k)2 ≤ n−1

[Eη41,jη41,k

]1/2 ≤ C1n−1θjθk.

20

Note that the spacing condition in (18) implies θm − θ2m � m−α, so we have

E∥∥∥φj − φj∥∥∥2 ≤ C

∑k:k 6=j

(θj − θk)−2 n−1θjθk

≤ Cn−1θj∑k:k 6=j

j2α ∑k:k≥2j

k−α +∑

k:k≤j/2

kα + j2(α+1)∑

k:2j≥k≥j/2

k−α

(1 + |j − k|)2

≤ C1n

−1j2 (38)

and the first part of lemma is proved.

For the second part of the lemma, equation (38) implies that it suffices to show that

for j ≤ nγ1 and all δ > 0

P

(∪k

{n1−δkαjα

[∫ ∫ (K(u, v)−K(u, v)

)φj (u)φk (v) dudv

]2≥ 1

})≤ cDn

−D.

(39)

For a large constant q > 0, we have

E∑k>nq

(θj − θk)−2[∫ ∫ (

K(u, v)−K(u, v))φj (u)φk (v) dudv

]2

≤ CEθ−2jn2

∑k>nq

(n∑i=1

ηi,jηi,k

)2

≤ C1θ−1j n−1θk ≤ Cqθ

−1j n−1n−qα,

which can be smaller than n−D by setting q sufficiently large. It follows from the Markov

inequality that

P

(∪k>nq

{n1−δkαjα

[∫ ∫ (K(u, v)−K(u, v)

)φj (u)φk (v) dudv

]2≥ 1

})≤ cDn

−D.

We need now only to consider k ≤ nq. Let w be a positive integer. Then

E

(1

n

n∑i=1

ηi,jηi,k

)2w

≤ n−wE (η1,jη1,k)2w ≤ n−w

[Eη4w1,jη4w1,k

]1/2 ≤ C1n−wθwj θ

wk

where the last inequality follows from (20). The Markov Inequality yields that for every

integer k > 0

P

{n1−δkαjα

[∫ ∫ (K(u, v)−K(u, v)

)φj (v)φk (v) dudv

]2≥ 1

}≤ C2n

−wδ.

By choosing w sufficiently large, this implies

P

(∪k≤nq

{n1−δkαjα

[∫ ∫ (K(u, v)−K(u, v)

)φj (u)φk (v) dudv

]2≥ 1

})≤ cDn

−D.

The equation (39) is then proved, and so is the second part of the lemma.

21

5.3.3 Proof of Lemmas 3 and 4

Since Var(bj) ≤ E(∫bφj −

∫bφj)

2, we will analyze∫bφj −

∫bφj =

∫b(φj − φj

)instead.

By the Cauchy-Schwarz inequality we have

E[∫

b(φj − φj

)]2≤ CE

∥∥∥φj − φj∥∥∥2 ≤ C1j2/n = o

(jα

n

). (40)

We need to analyze dj = θ− 1

2j gj. It follows from (12) that

dj = θ− 1

2j gj = θ

12j bj + θ

− 12

j

1

nx′·,j(Z− Z).

Hence, E(dj) = E(θ12j bj). Same as before, it follows from the fact Var(Y ) = E(Var(Y |X))+

Var(E(Y |X)) for any two random variables X and Y that

Var(dj) = Var(θ12j bj) +

σ2

n2

n∑i=1

E(θ−1j x2i,j) = Var(θ12j bj) +

σ2

n.

We need to bound Var(θ12j bj). Note that

Var(θ12j bj) ≤ E

12j bj − θ

1/2j bj

)2≤ 2E

12j − θ

1/2j

)2b2j + 2θjE

(bj − bj

)2≤ 2E

12j − θ

1/2j

)2b2j + Cn−1j2−α

≤ 2E

(θj − θjθ1/2j

)2

b2j + Cn−1j2−α

≤ Cn−1j−2β+α + Cn−1j2−α ≤ C1n−1j2−α. (41)

Here the third inequality follows from (40).

5.3.4 Proof of Lemma 5

Recall that

dj = θ−1/2j gj = θ

1/2j bj + θ

−1/2j

1

nx′·,j(Z− Z).

The second term is dominant. We consider this term first. Since

1

n

n∑i=1

xi,jxi,k = θjδj,k,

22

we havem2∑j=m1

[θ−1/2j

1√nx′·,jZ

]2∼ σ2

nχ2m2−m1+1.

So for any D > 0 there exists a constant CD > 0 such that

P

(m2∑j=m1

θ−1j

[1

nx′·,j(Z− Z)

]2> (1 + ε) (m2 −m1)

σ2

n

)≤ CDn

−D. (42)

Now we turn to the first term. It is easy to see

m2∑j=m1

θjb2j ≤ ε

m2 −m1

n,

and for any D > 0

P(∣∣∣θj − θj∣∣∣ ≥ εθj, j ≤ nγ1

)≤ CDn

−D.

We need only to show that for any D > 0

P

(m2∑j=m1

θj

[∫b(φj − φj

)]2> ε (m2 −m1)

σ2

n

)≤ CDn

−D.

By the Cauchy-Schwarz inequality it suffices to show that for any D > 0

P(θj

∫ (φj − φj

)2> ε

σ2

n

)≤ CDn

−D. (43)

This follows directly from Lemma 2.

5.3.5 Proof of Lemma 6

We write∑j∈Bi

ε2j =∑j∈Bi

(dj − d′j

)2=∑j∈Bi

12j bj − d′j + θ

− 12

j

1

nx′·,j(Z− Z)

]2=

∑j∈Bi

12j bj − d′j

)2+ 2

∑j∈Bi

12j bj − d′j

)θ− 1

2j

1

nx′·,j(Z− Z) +

∑j∈Bi

[θ− 1

2j

1

nx′·,j(Z− Z)

]2

≤∑j∈Bi

12j bj − d′j

)2+ 2

{∑j∈Bi

12j bj − d′j

)2 ∑j∈Bi

[θ− 1

2j

1

nx′·,j(Z− Z)

]2}1/2

+∑j∈Bi

[θ− 1

2j

1

nx′·,j(Z− Z)

]223

We first show equation (28). From equation (42) it suffices to prove that, when

λ = 1 + ε and Li ≡ |Bi| ≥ nδ for some δ > 0,

P

{∑j∈Bi

12j bj − d′j

)2>ε

3Liσ2

n

}≤ cDn

−D

for any D > 0 where CD > 0 is a constant. Note that, when j ≤ nγ1 , for any D > 0 there

exists a constant CD > 0 such that

P(∣∣∣θj − θj∣∣∣ ≥ ε2θj

)≤ CDn

−D

and

E(θ

12j bj − d′j

)2= o

(1

n

)as j →∞.

It then suffices to show that for all D > 0

P

(∑j∈Bi

θj

[∫b(φj − φj

)]2> εLi

σ2

n

)≤ CDn

−D.

This is true following similar arguments as in the proof of Lemma 5 with Li ≥ nδ for

some δ > 0.

Equation (29) follows easily from the fact

E∑j∈Bi

ε2j = E∑j∈Bi

12j bj − d′j

)2+ E

∑j∈Bi

[θ− 1

2j

1

nx′·,j(Z− Z)

]2where the first term is bounded by C

nLi from equation (41) and the second term is exactly

σ2

nLi.

References

[1] Bhatia, R., Davis, C. and McIntosh, A. (1983). Perturbation of spectral subspaces

and solution of linear operator equations. Linear Algebra Appl. 52/53, 45-67.

[2] Brown, L. D., Cai, T. T., Zhang, R., Zhao, L. H. and Zhou, H. H. (2010). The Root-

unroot algorithm for density estimation as implemented via wavelet block threshold-

ing. Probability Theory and Related Fields 146, 401-433.

[3] Cai, T. T. (1999). Adaptive wavelet estimation: A block thresholding and oracle

inequality approach. Ann. Statist. 27, 898-924.

24

[4] Cai, T. T. and Hall, P. (2006). Prediction in functional linear regression. Ann. Statist.

34, 2159-2179.

[5] Cai, T. T., Low, M., and Zhao, L. (2009). Sharp adaptive estimation by a blockwise

method. Journal of Nonparametric Statistics 21, 839-850.

[6] Cardot and Sarda, P. (2006). Linear regression models for functional data. In The

Art of Semiparametrics, pp. 49-66. Physica-Verlag HD.

[7] Cuevas, A., Febrero, M. and Fraiman, R. (2002). Linear functional regression: the

case of fixed design and functional response. Canad. J. Statist. 30, 285-300.

[8] Efromovich, S. Y. (1985). Nonparametric estimation of a density of unknown smooth-

ness. Theory Probab. Appl. 30, 557-661.

[9] Ferraty, F. and Vieu, P. (2000). Fractal dimensionality and regression estimation in

semi-normed vectorial spaces. C. R. Acad. Sci. Paris Ser. I 330, 139-142.

[10] Ferraty, F. and Vieu, P. (2006). Nonparametric Functional Data Analysis. Springer,

New York.

[11] Hall, P. and Horowitz, J. L. (2007). Methodology and convergence rates for functional

linear regression. Ann. Statist. 35, 70-91.

[12] Hall, P. and Hosseini-Nasab, M. (2006). On properties of functional principal com-

ponents analysis. J. R. Statist. Soc. B 68, 109-126.

[13] Hall, P., Kerkyacharian, G. and Picard, D. (1998). Block threshold rules for curve

estimation using kernel and wavelet methods. Ann. Statist. 26, 922-942.

[14] Li, Y. and Hsing, T. (2007). On rates of convergence in functional linear regression.

J. Multivariate Analysis 98, 1782-1804.

[15] Muller, H.-G. and Stadtmuller, U. (2005). Generalized functional linear models. Ann.

Statist. 33, 774-805.

[16] Ramsay, J.O. and Silverman, B.W. (2002). Applied Functional Data Analysis: Meth-

ods and Case Studies. Springer, New York.

[17] Ramsay, J.O. and Silverman, B.W. (2005). Functional Data Analysis, 2nd Edition.

Springer, New York.

25