functional linear regression that’s interpretable

24
Functional Linear Regression That’s Interpretable GARETH M. JAMES and J I ZHU Abstract Regression models to relate a scalar Y to a functional predictor W (t ) are becoming increasingly common. Work in this area has concentrated on estimating a coefficient function, β(t ), with Y related to W (t ) through β(t ) W (t )dt . Regions where β(t ) = 0 correspond to places where there is a relation- ship between W (t ) and Y . Alternatively, points where β(t )= 0 indicate no relationship. Hence, for interpretation purposes, it is desirable for a regression procedure to be capable of producing estimates of β(t ) that are exactly zero over certain regions and have simple structure, such as a piecewise constant or linear form, over the other regions. Unfortunately, most fitting procedures result in an estimate for β(t ) that is rarely exactly zero and has unnatural wiggles making the curve hard to interpret. In this article we introduce a new approach which uses variable selection ideas, applied to various derivatives of β(t ), to produce estimates that are both interpretable, flexible and accurate. We call our method “Functional Linear Regression That’s Interpretable” (FLRTI) and demonstrate it on simulated and real world data sets. In addition, non-asymptotic theoretical bounds on the estimation error are presented. The bounds provide strong theoretical motivation for our approach. Some key words: Generalized Variable Selection; Functional Linear Regression; Dantzig Selector. 1 Introduction In recent years functional data analysis (FDA) has become an increasingly important analytical tool as more data has arisen where the primary unit of observation can be viewed as a curve or in general a function. One of the most useful tools in FDA is that of functional regression. This setting can correspond to either functional predictors or functional responses. See Ramsay and Silverman (2002) and Muller and Stadt- muller (2005) for numerous specific applications. One commonly studied problem involves data containing functional responses. A sampling of papers examining this situation includes Fahrmeir and Tutz (1994), Liang and Zeger (1986), Faraway (1997), Hoover et al. (1998), Wu et al. (1998), Fan and Zhang (2000) and Lin and Ying (2001). However, in this paper, we are primarily interested in the alternative situation where we observe a set of observations { W i (t ), Y i } for i = 1,..., n where W i (t ) is a functional predictor and Y i a real valued response. Ramsay and Silverman (2005) discuss this scenario and several papers have also been written on the topic, both for continuous or categorical responses, and for linear or non-linear models (Hastie and Mallows, 1993; James and Hastie, 2001; Ferraty and Vieu, 2002; James, 2002; Ferraty and Vieu, 2003; Muller and Stadtmuller, 2005; James and Silverman, 2005) Marshall School of Business, University of Southern California Department of Statistics, University of Michigan 1

Upload: others

Post on 12-Feb-2022

11 views

Category:

Documents


0 download

TRANSCRIPT

Functional Linear Regression That’s Interpretable

GARETH M. JAMES∗and JI ZHU†

Abstract

Regression models to relate a scalarY to a functional predictorW(t) are becoming increasinglycommon. Work in this area has concentrated on estimating a coefficient function,β(t), with Y relatedto W(t) through

∫β(t)W(t)dt. Regions whereβ(t) 6= 0 correspond to places where there is a relation-

ship betweenW(t) andY. Alternatively, points whereβ(t) = 0 indicate no relationship. Hence, forinterpretation purposes, it is desirable for a regression procedure to be capable of producing estimates ofβ(t) that are exactly zero over certain regions and have simple structure, such as a piecewise constant orlinear form, over the other regions. Unfortunately, most fitting procedures result in an estimate forβ(t)that is rarely exactly zero and has unnatural wiggles makingthe curve hard to interpret. In this articlewe introduce a new approach which uses variable selection ideas, applied to various derivatives ofβ(t),to produce estimates that are both interpretable, flexible and accurate. We call our method “FunctionalLinear Regression That’s Interpretable” (FLRTI) and demonstrate it on simulated and real world datasets. In addition, non-asymptotic theoretical bounds on the estimation error are presented. The boundsprovide strong theoretical motivation for our approach.

Some key words: Generalized Variable Selection; Functional Linear Regression; Dantzig Selector.

1 Introduction

In recent years functional data analysis (FDA) has become anincreasingly important analytical tool as moredata has arisen where the primary unit of observation can be viewed as a curve or in general a function.One of the most useful tools in FDA is that of functional regression. This setting can correspond to eitherfunctional predictors or functional responses. See Ramsayand Silverman (2002) and Muller and Stadt-muller (2005) for numerous specific applications. One commonly studied problem involves data containingfunctional responses. A sampling of papers examining this situation includes Fahrmeir and Tutz (1994),Liang and Zeger (1986), Faraway (1997), Hooveret al. (1998), Wuet al. (1998), Fan and Zhang (2000)and Lin and Ying (2001). However, in this paper, we are primarily interested in the alternative situationwhere we observe a set of observationsWi(t),Yi for i = 1, . . . ,n whereWi(t) is a functional predictor andYi a real valued response. Ramsay and Silverman (2005) discussthis scenario and several papers have alsobeen written on the topic, both for continuous or categorical responses, and for linear or non-linear models(Hastie and Mallows, 1993; James and Hastie, 2001; Ferraty and Vieu, 2002; James, 2002; Ferraty and Vieu,2003; Muller and Stadtmuller, 2005; James and Silverman, 2005)

∗Marshall School of Business, University of Southern California†Department of Statistics, University of Michigan

1

Since our primary interest here is interpretation we will beexamining the standard functional linearregression (FLR) model which relates functional predictors to a scalar response via

Yi = β0 +

∫Wi(t)β(t)dt + εi, i = 1, . . . ,n (1)

whereβ(t) is the “coefficient function”. We will assume thatWi(t) is scaled such that 0≤ t ≤ 1. As withstandard linear regression,β(t) determines the effect ofWi(t) on Yi . For example, changes inWi(t) haveno effect onYi over regions whereβ(t) = 0. Alternatively, changes inWi(t) have a greater effect onYi

over regions where|β(t)| is large. Clearly, for any finiten, it would be possible to perfectly interpolate theresponses if no restrictions are placed onβ(t). Such restrictions generally take one of two possible forms.The first method, which we call the “basis approach”, involves representingβ(t) using ap-dimensional basisfunction,β(t) = B(t)Tη wherep is hopefully large enough to capture the patterns inβ(t) but small enough toregularize the fit. With this method (1) can be reexpressed asYi = β0+XT

i η+εi , whereX i =∫

Wi(t)B(t)dt,andη can be estimated using ordinary least squares. The second method, which we call the “penalizationapproach”, involves a penalized least squares estimation procedure to shrink variability inβ(t). A commonpenalty is of the form

∫β′′(t)2dt in which case one would findβ(t) to minimize

n

∑i=1

(Yi −β0−

∫Wi(t)β(t)dt

)2

+ λ∫

β′′(t)2dt (2)

for someλ > 0. Asλ → ∞, (2) will force β(t) to converge to a linear function.However, for either approach,β(t) will generally exhibit wiggles and will not be exactly linear over

any region. In addition,β(t) will be exactly equal to zero at no more than a few locations even if there isno relationship betweenW(t) andY for large regions oft. This is a significant disadvantage in terms ofinterpretation because the simpler the structure ofβ(t) the easier it is to understand the effect ofX(t) onY.Figure 1a) provides an illustration of this difficulty on a simulated data set. Here the “true”β(t), that we useto generate the data, is piecewise linear and exactly zero roughly between 0.25 and 0.75. The best fittingb-spline basis, with the number of knots chosen using cross-validation, is displayed in black. Even thoughthe trueβ(t) curve has a simple structure, our best estimate takes on a considerably more complicated formmaking interpretation difficult. Figure 1b) provides an alternative, more easily interpretable estimate usingour approach. The estimate (shown in black) almost perfectly recovers the simple structure of the originalcoefficient curve. Figure 1c) shows the errors in estimatingβ(t) from the spline estimate (dashed line) andour approach (solid line).

The main goal of this paper is to develop a functional linear regression method that produces accurate,but also highly interpretable, estimates for the coefficient functionβ(t). In the process we also demonstratethat our method is computationally efficient, extremely flexible in terms of the form of the estimate, and hashighly desirable theoretical properties. The key to our procedure is to reformulate the problem as a formof variable selection. We take a somewhat novel approach by first developing the concept of “Generalizedvariable selection” (GVS) where, rather than identifying the significant variables as in standard regression,we attempt to identify significant linear combinations of the variables. The GVS paradigm is applicable tomany important problems but here we demonstrate it on FLR. Inparticular we use GVS to identify regionswhere, for example,β′′(t) is significantly different from zero which allows us to automatically produce anestimate that is exactly linear except where there is clear evidence to the contrary. However, the approach

2

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

2025

30

Time

Bet

a

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

2025

30

Time

Bet

a

0.0 0.2 0.4 0.6 0.8 1.0

−0.

50.

00.

51.

0

Time

Err

or

Figure 1:a) The beta curve (grey) used to produce a simulated set of fifty responses, Yi , from corresponding predictors,Wi(t). The best fitting b-spline estimate is shown in black. b) The same beta curve and corresponding estimate usingthe approach from this paper. c) Estimation errors using theb-spline (dashed) and our approach (solid).

a) b) c)

is flexible enough that we can also easily identify regions whereβ(t),β′(t) or higher order derivatives arenon zero, hence producing a curve that, if warranted, is exactly zero over some regions, exactly linear overothers and perhaps exactly constant over yet other regions.An integral part of our approach is the methodfor implementing the variable selection procedure. The method needs to perform well in situations wherethere are many more variables than observations but most of these variables can be taken to be exactly zero.The Dantzig Selector (Candes and Tao, 2006) is a new approach, for performing standard linear regression,which was developed for exactly this scenario. It is highly computationally efficient, has very desirabletheoretical properties and has shown impressive empiricalperformance on real world problems. To this endwe adapt the methodology and theoretical results that Candes and Tao (2006) provide for standard linearregression models to the more complicated FLR setting and demonstrate that tight non-asymptotic boundscan be placed on the error in the estimated coefficient function. In addition we are able to prove asymptoticconvergence properties. We have named our method “Functional Linear Regression That’s Interpretable”(FLRTI).

The paper is laid out as follows. In Section 2 we introduce thegeneralized variable selection paradigm.Then in Section 3 we develop the FLRTI model by drawing a connection between GVS and functional linearregression. Section 3 also introduces the Dantzig Selectorand demonstrates how it can be used to producea flexible, interpretable and accurate estimate forβ(t). We then extend the FLRTI method to allow multiplederivatives to be controlled simultaneously in Section 4. This allows us to, for example, by controllingthe zeroth and second derivatives, produce aβ(t) curve that is exactly zero in certain sections and exactlylinear in other sections. Several simulation examples are also provided. The theoretical developments forour method are presented in Section 5 where we outline both non-asymptotic bounds on the error as well as

3

asymptotic properties of our estimate asn grows. We apply FLRTI to real world data in Section 6 and endwith a discussion in Section 7.

2 Generalized Variable Selection

In general a model selection problem can be defined as the situation where one is attempting to choose amodel,M, from among a class of models,M α. In a parametric settingM α is generally chosen to bea large class of models indexed by parameters,β = (β1, . . . ,βp). An important model selection probleminvolves that of “variable selection” where choosing the model M requires selecting the index setZ = j :β j 6= 0.

Consider, for example, the standard linear regression model

Yi = β0 +XTi β+ εi = β0+

p

∑j=1

Xi j β j + εi, i = 1, . . . ,n, (3)

whereX i represents a vector ofp predictors,β a vector of regression coefficients,Yi the response andεi anerror term. Standard variable selection involves determining whichβ j 6= 0. Traditionally, most data sets haveconsisted ofp<< n. In this case a common approach has been “best subset selection” where the model is fitfor all possible combinations of predictors and the “best”,by some measure, is chosen. Recently, situationswhere p is of similar size ton, or even possibly significantly larger thann, are becoming increasinglycommon. Some examples include functional MRI and tomography, gene expression studies where thereare many genes and few observations, signal processing and curve smoothing. In such situations classicalapproaches, such as best subset selection, will generally fail because they are usually not computationallyfeasible for largep. As a result there has been a great deal of development of new model selection methodsthat work with large values ofp. A few examples include the Non-negative Garrote (Breiman,1995),the Lasso (Mallat and Zhang, 1993; Tibshirani, 1996; Chenet al., 1998), SCAD (Fan and Li, 2001), theElastic Net (Zou and Hastie, 2005), and the Dantzig Selector(Candes and Tao, 2006). In order for themodel selection problem to be solvable forp> n these procedures all, either implicitly or explicitly, assumesparsity in the regression coefficients.

In this paper we are interested, not in the situation whereβ itself is sparse, but instead where sparsityis induced through a series of constraints on the coefficientvector. Our problem also involves choosinga modelM ∈ M β whereM β represents a large class of parametric models. However, in addition wecompute

γk = fk(β), k = 1, . . . ,K, (4)

where thefk’s representK predefined functions. Then instead of selecting the non-zero components ofβ,we wish to identify the non-zero values of theγk’s. We call this problem “Generalized Variable Selection”(GVS). Notice that standard variable selection is a specialcase of GVS whenfk(β) = βk for k = 1, . . . , p. Inthe GVS setting the parameters,β, in our final model,M, may not have any zero components but instead theparameters are restricted through the equationsfk(β) = 0 for all k such thatγk = 0. The GVS problem turnsout to have potential applications in areas as diverse as non-linear curve estimation, principal components(both standard and functional) and transcription regulation network problems for microarray experiments.In this paper we are interested in applying GVS to the functional linear regression problem. We show in

4

Section 3 that, for FLR, the constraints given by (4) are all linear, in which case

γ = Aβ (5)

whereA is a knownK by p matrix.The approach for solving this problem depends somewhat on whetherK = p, K < p or K > p. In this

paper we do not address the situation whereK < p, although it is discussed briefly in Appendix E. InSection 3 we concentrate on the case whereK = p. In Section 4 we cover the more general scenario whereK ≥ p. WhenK = p, provided none of the constraints are redundant, thenA will be invertible soβ = A−1γand the standard linear regression model from (3) can be expressed as

Yi = β0 +(

A−1TX i

)Tγ+ εi i = 1, . . . ,n. (6)

In this case the GVS problem can be reduced to a standard variable selection situation where we attempt toidentify the non-zero elements ofγ subject to the model given by (6).

3 Functional Linear Regression

In this section we formulate the functional linear regression model as a GVS problem and demonstrate howto use a modified version of the “Dantzig Selector” to estimate β(t).

3.1 A GVS Functional Linear Regression Model

Our approach borrows ideas from the basis and penalization methods but is rather different from either. Westart, in a similar vein to the basis approach by selecting ap-dimensional basisB(t). However, instead ofassumingB(t) provides a perfect fit forβ(t) we allow for some error using the model

β(t) = B(t)Tη+e(t), (7)

wheree(t) represents the deviations of the trueβ(t) from our model. In addition, unlike the basis approachwherep is chosen small to provide some form of regularization, we typically choosep >> n so |e(t)| cangenerally be assumed to be small. In Section 5 we show that theerror rate in the estimate forβ(t) canpotentially be bounded by a function that only increases at the rate of

√log(p) so there is little loss in

accuracy from choosing a large value forp.Combining (1) and (7) we arrive at

Yi = β0 +XTi η+ ε∗i , (8)

whereX i =∫

Wi(t)B(t)dt andε∗i = εi +∫

Wi(t)e(t)dt. Clearly a method such as ordinary least squares cannot be applied to (8) becausep > n. One could potentially estimateη using a variable selection procedureexcept that for an arbitrary basis,B(t), there is no reason to suppose thatη will be sparse. In fact for manybasesη will contain no zero elements. However, often we may believethat one or more derivatives ofβ(t)are sparse i.e.

β(d)(t) = 0 (9)

5

over large regions oft for one or more values ofd = 0,1,2, . . .. For example,β(0)(t) = 0 impliesW(t) hasno effect onY at t, β(1)(t) = 0 implies thatβ(t) is constant att, β(2)(t) = 0 implies thatβ(t) is linear att etc.The sparsity constraint given by (9) can be approximated by assuming

γ = Aη (10)

is sparse for a certainA. In particular, we divide[0,1] into p evenly spaced points,t1, t2, . . . , tp and let

A = [DdB(t1),DdB(t2), . . . ,D

dB(tp)]T

whereDd is thedth finite difference operator i.e.

DB(t j) = p[B(t j)−B(t j−1)] , D2B(t j) = p2 [B(t j)−2B(t j−1)+B(t j−2)]

etc. For example, one may believe thatβ(2)(t) = 0 over many regions oft i.e. β(t) is exactly linear overlarge regions oft. In this situation we would let

A = [D2B(t1),D2B(t2), . . . ,D

2B(tp)]T . (11)

Clearly (10) falls into the GVS paradigm. Providedp is large enough, estimating a sparseγ using (11) willnot only enforce the constraints onβ(2)(t) but will also allow us to construct an accurate estimate forβ(t).

If we choose a set ofK = p linearly independent constraints such as those given in (11) we may combine(8) and (10) to produce the FLRTI model

Y = Vγ+ ε∗ (12)

whereV = [1|XA−1], 1 is a vector of ones andβ0 has been incorporated intoγ.

3.2 The Dantzig Selector

Fitting the FLRTI model given by (12) poses some difficultiesbecausep > n so standard least squaresapproaches will not suffice. However,γ is assumed sparse which means a fitting procedure is potentiallyfeasible. In the standard linear regression setting, the Dantzig Selector (Candes and Tao, 2006) was designedfor exactly such a scenario i.e. a model with largep but a sparse set of coefficients. It has demonstratedexcellent empirical performance, is highly computationally efficient and has strong theoretical justifications.

Unlike most variable selection methods which opt to minimize the sum of the squared errors subject tosome form of penalty on the coefficients, the Dantzig Selector minimizes theL1 norm of the coefficientssubject to constraints on the errors. In particular, consider the linear regression modelY = Xβ+ε. Then theDantzig Selector estimate forβ is given by minimizing

||β||l1 subject to |XTj (Y −Xβ)| ≤ λ j = 1, . . . , p (13)

whereX j is the jth column ofX andλ is a tuning parameter. In this setupX j is assumed to be norm one butwe show in the following section that this assumption is easily removed.

Candes and Tao (2006) provide theoretical justification forthis approach. However, an intuitive motiva-tion can be observed by noting that, forλ = 0, the constraint in (13) is equivalent toXTY = XTXβ which issimply the least squares normal equation. Hence forp < n andλ = 0 the Dantzig Selector will produce the

6

ordinary least squares solution. However, forλ > 0 then the Dantzig Selector searches for theβ with mini-mumL1 norm that is within a given distance of the maximum likelihood solution i.e. the sparsestβ that isstill reasonably consistent with the observed data. Noticethat even forp > n, where the likelihood equationwill have infinite possible solutions, providedβ is sparse, this approach can still hope to identify the correctsolution because it is only attempting to locate the sparsest β close to a peak of the likelihood function. Onemight imagine that minimizing theL0 norm, which counts the number of non-zero components of a vector,would be more appropriate than theL1 norm. However, directly minimizing theL0 norm is computationallydifficult and one can show that, under suitable conditions, theL1 norm will also provide a sparse solution.

The Dantzig Selector is extremely computationally efficient. This is a result of the fact that (13) can beformulated as a linear programming problem

minp

∑j=0

u j subject to −u ≤ β ≤ u and −λ1≤ XT(Y −Xβ) ≤ λ1

where1 is a vector of ones. As a consequence standard linear programming software can easily be usedto fit the data. In addition Candes and Tao (2006) prove tight non-asymptotic bounds on the error in theestimator forβ. We discuss these bounds and extend them to the functional regression domain in Section 5.

3.3 Implementing The Fitting Procedure

The Dantzig Selector assumes norm one columns in the design matrix and was not envisioned for the gen-eralized variable selection problem. However, in the case whereK = p, we can use the transformed modelgiven by (12) to adapt the Dantzig Selector to our functionalregression problem. In particular, letDv be adiagonal matrix withjth diagonal component equal to the norm of thejth column ofV, i.e. ||V1|| =

√n and

||Vj+1|| =√

∑ni=1

(∫Wi(t)B(t)TdtA−1

j

)2whereA−1

j is the jth column ofA−1. Then for the model given by

(12) the Dantzig Selector estimatesγ by minimizing

||Dvγ||l1 subject to ||D−1v VT(Y −Vγ)||l∞ ≤ λ. (14)

Note thatDv adjusts the coefficients so that they are all measured on a comparable scale. After, the coeffi-cients,γ, have been obtained we produce the FLRTI estimate forβ(t) using

β(t) = B(t)T η = B(t)TA−1γ(−1) (15)

whereγ(−1) equalsγ after removing the estimate forβ0.

4 Controlling Multiple Derivatives

So far we have concentrated on controlling a single derivative of β(t). However, one of the most power-ful aspects of the FLRTI approach is that we can combine constraints for multiple derivatives together toproduce curves with many different properties. For example, one may believe that bothβ(0)(t) = 0 andβ(2)(t) = 0 over many regions oft i.e. β(t) is exactly zero over certain regions andβ(t) is exactly linear over

7

other regions oft. In this situation we would let,

A = [D0B(t1),D0B(t2), . . . ,D

0B(tp),D2B(t1),D

2B(t2), . . . ,D2B(tp)]

T . (16)

In general, such a matrix will result inK > p. Let A(1) represent the firstp rows ofA andA(2) the remainder.Similarly, let γ(1) represent the firstp elements ofγ andγ(2) the remaining elements. Then, assumingA isarranged so thatA(1) is invertible, (3) can be expressed as

Yi = β0+(

A−1(1)

TX i

)Tγ(1) + εi i = 1, . . . ,n. (17)

Hence we wish to select the non-zero components ofγ subject to the model given by (17) andγ(2) =

A(2)A−1(1)γ(1). Note, that forK > p the sparsity ofγ is constrained because, except for the degenerate so-

lution γ = 0, we can only guarantee at mostp of theγk’s will equal zero.When constraining multiple derivatives one may well not wish to place equal weight on each derivative.

For example, for theA given by (16), we may wish to place a greater emphasis on sparsity in the secondderivative than in the zeroth, or vice versa. Hence we generalize (14) by computingγ to be the quantityminimizing

||ΩDvγ||l1 subject to ||D−1v(1)

VT(1)(Y −V(1)γ(1))||l∞ ≤ λ and γ(2) = A(2)A

−1(1)γ(1) (18)

whereΩ is a diagonal weighting matrix. In theory a different weightcould be chosen for eachγ j but inpractice this would not be feasible. Instead, for anA such as (16), we place a weight of one on the secondderivatives and select a single weight, chosen via cross-validation, for the zeroth derivatives. This approachprovides flexibility while still being computationally feasible and has worked well on all the problems wehave examined.

We illustrate some results from this approach in Figures 2, 3, and 4. In all cases we used cross-validationto select bothλ and the weight on the zeroth derivative. In each plot of Figure 2 the grey curve representsthe true beta curve used to generate a simulated data set consisting of fifty pairs of responseYi and predictorWi(t). Random Gaussian noise withσ = 2 was added to the response. The black lines are the correspondingFLRTI estimates. The estimates were all produced using ap = 100 dimensional piecewise constant basiswhere thekth component ofB(t) equals 1 ift ∈ Rk = t : k−1

p < t ≤ kp and zero otherwise. Figure 2a)

contains the estimate obtained by assuming sparsity in the zeroth and first derivative. This results in the bestapproximating step function to the true curve. Notice that the sparsity assumption on the zeroth derivativeensures that the estimatedβ(t) is exactly zero over the region where the trueβ(t) is zero. Figure 2b) showsthe corresponding FLRTI estimate generated by assuming sparsity in the zeroth and second derivatives andis almost a perfect fit. The estimate in Figure 2c) assumes sparsity in the zeroth and third derivative. Noticethat the sparsity in the third derivative induces a smootherestimate. Finally, Figure 2d) shows the estimatedand true beta curve, over the region fromt = 0.3 to t = 0.7, resulting from only assuming sparsity in thesecond derivative. The overall fit is still good for the non-zero values ofβ(t) but the estimate is no longerexactly zero over the region whereβ(t) = 0.

Figure 3 illustrates a differentβ(t) produced using quadratic curves at the lower and higher timeperiodsandβ(t) = 0 in the middle time points. The response and predictors wereotherwise generated in an identicalfashion to the previous simulation withσ = 0.01. In Figure 3a) we have plotted the trueβ(t), in grey, andthe estimate, produced by constraining the zeroth and thirdderivatives, in black. For this simulated data set

8

0.0 0.4 0.8

05

1015

2025

30

Time

Bet

a

0.0 0.4 0.8

05

1015

2025

30

Time

Bet

a

0.0 0.4 0.8

05

1015

2025

30

Time

Bet

a

0.3 0.4 0.5 0.6 0.7

0.0

0.5

1.0

1.5

Time

Bet

a

Figure 2:Plots of true beta curve (grey) and corresponding FLRTI estimates (black). For each plot we constraineda) zeroth and first derivative, b) zeroth and second derivative, c) zeroth and third derivative, d) second derivative only.

a) b) c) d)

the fit is almost perfect. In addition by constraining the third derivative we have achieved a relatively smoothestimate. Figure 3b) illustrates the same plot concentrating on the region between 0.3 and 0.7. The dashedline is the best fitting b-spline estimate. Notice that, while the b-spline gives a reasonable approximation tothe quadratic part of the curve, it provides a poor approximation for the region whereβ(t) = 0.

Finally, Figure 4 illustrates a simulation where one would not expect FLRTI to provide any advantageover a standard approach such as using a b-spline basis. Herethe data was simulated as above (withσ =

0.5) but β(t) was chosen as a standard cubic curve. In this scenario there is no simple structure such aspiecewise linear or exact zero regions for FLRTI to take advantage of and one might expect the b-splinebasis to provide superior results. However, Figure 4 shows that, even in this situation, the FLRTI methodcan give highly accurate estimates. In this example we constrained the zeroth and fourth derivatives withthe tuning parameters chosen via cross-validation but similar results were achieved when we constrained thethird derivative. It would be a simple matter to use cross-validation to select the optimal derivative. Thisexample illustrates the flexibility of FLRTI in that it can produce both simple linear estimates, that are easyto interpret, as well as more complicated non-linear structures with equal ease.

5 Theoretical Results

In this section we show that, not only does the FLRTI approachempirically produce good estimates forβ(t),but that, for anyp by p invertibleA, we can in fact prove tight, non-asymptotic, bounds on the error in ourestimate. In addition we derive asymptotic rates of convergence. Note that for notational convenience theresults in this section assumeβ0 = 0 and drop the intercept term from the model. However, the theory allextends in a straightforward manner to the situation withβ0 unknown.

9

0.0 0.2 0.4 0.6 0.8 1.0

−0.

20.

00.

10.

2

Time

Bet

a

0.4 0.5 0.6 0.7

−0.

010

0.00

00.

010

Time

Bet

aFigure 3: a) True beta curve (grey) generated from two quadratic curves and a section withβ(t) = 0. The FLRTIestimate from constraining the zeroth and third derivativeis shown in black. b) Same plot for the region0.3≤ t ≤ 0.7.The dashed line is the best b-spline fit.

a) b)

5.1 Definitions

In order to prove our results we present two definitions first introduced in Candes and Tao (2005).

Definition 1 Let X be an n by p matrix with norm1 columns and let XT ,T ⊂ 1, . . . , p be the n by|T|submatrix obtained by extracting the columns of X corresponding to the indices in T . Then we defineδX

S asthe smallest quantity such that

(1−δXS)||c||2l2 ≤ ||XTc||2l2 ≤ (1+ δX

S)||c||2l2

for all subsets T with|T| ≤ S and all vectorsc of length |T|. Candes and Tao (2005) namedδS the S-restricted isometry constant.

δXS is essentially a measure of how closeX is to orthogonal. IfX is exactly orthogonal then||XTc||2 = ||c||2

for all T andc and henceδXS = 0. Alternatively, if two columns ofX are identical then, for certainT andc,

||XTc||2 = 0 and henceδXS = 1. We will show that in order to recoverβ it must be the case thatδX

S < 1 andpreferablyδX

S is close to zero.

Definition 2 Let T and T′ be two disjoint sets with T,T ′ ⊂ 1, . . . , p, |T| ≤ S and|T ′| ≤ S′. Then, providedS+S′ ≤ p, we defineθX

S,S′ as the smallest quantity such that

|(XTc)T XT′c′|||c||l2||c′||l2

≤ θXS,S′ (19)

10

0.0 0.2 0.4 0.6 0.8 1.0

1.00

1.05

1.10

1.15

Time

Bet

a

0.0 0.2 0.4 0.6 0.8 1.0

0.00

00.

010

0.02

0

Time

Err

ors

Figure 4:a) True beta curve (grey) generated from a cubic curve. The FLRTI estimate from constraining the zerothand fourth derivative is represented by the solid black lineand the b-spline estimate is the dashed line. b) Estimationerrors using the b-spline (dashed) and FLRTI (solid).

a) b)

for all T and T′ and all corresponding vectorsc and c′. Candes and Tao (2005) namedθXS,S′ the S,S′-

restricted orthogonality constant.

Notice that the left hand side of (19) is equal to the cosine ofthe angle betweenXTc andXT ′c′ soθXS,S′ is also

a measure of the orthogonality ofX. If X is exactly orthogonal thenXTc andXT ′c′ will be perpendicular andhenceθX

S,S′ = 0. Alternatively, ifXT andXT ′ are linear combinations of each other, then the angle betweenXTc andXT ′c′ will be zero for somec andc′ and henceθX

S,S′ = 1.

5.2 A Non-Asymptotic Bound On The Error

For a standard linear model where the coefficient vector isS sparse, Candes and Tao (2006) prove a tightbound on theL2 norm of the error between the Dantzig Selector estimates andthe true coefficients, providedδX

2S+θXS,2S< 1. Theorem 1 extends this result to the somewhat more complicated functional linear regression

setting.

Theorem 1 For a given p-dimensional basisBp(t), let ωp = supt |ep(t)| and γp = Aηp where A is a p byp matrix. LetV = D−1

v V where V is the design matrix from (12). Suppose thatγp has at most Sp non-zero

components andδV2Sp

+ θVSp,2Sp

< 1. Further suppose that we estimateβ(t) using the FLRTI estimate givenby (15) using any value ofλ such that

max|VTε∗| ≤ λ. (20)

Then, for every0≤ t ≤ 1, ∣∣∣β(t)−β(t)∣∣∣≤ 1√

nCn,p(t)λ

√Sp + ωp (21)

11

where Cn,p(t) =4αn,p(t)

1−δV2S−θV

S,2S

andαn,p(t) =

√∑p

j=1(Bp(t)T A−1

j )2

1n ∑n

i=1(∫

Wi(s)Bp(s)TA−1j ds)2 .

Theorem 1 suggests that our optimal choice forλ would be the lowest value such that (20) holds. Unfortu-nately, this is not feasible because theε∗i ’s are unobserved random variables. However, Corollary 1 showsthat we can chooseλ such that (20) holds with high probability.

Corollary 1 Suppose thatεi ∼ N(0,σ21) and that there exits an M< ∞ such that

∫ |Wi(t)|dt ≤ M for alli. Then for anyφ ≥ 0, if λ = σ1

√2(1+ φ) log p+ Mωp

√n then (20) will hold with probability at least

1−(

pφ√

4π(1+ φ) log p)−1

, and hence

∣∣∣β(t)−β(t)∣∣∣≤ 1√

nCn,p(t)σ1

√2Sp(1+ φ) logp+ ωp

1+Cn,p(t) ·M

√Sp

. (22)

In addition, if we assumeε∗i ∼N(0,σ22) then (20) will hold with the same probability forλ = σ2

√2(1+ φ) log p

in which case ∣∣∣β(t)−β(t)∣∣∣≤ 1√

nCn,p(t)σ2

√2Sp(1+ φ) log p+ ωp. (23)

Note that (22) and (23) are non-asymptotic results that hold, with high probability, for anyn or p. Theproperties of both (22) and (23) as a function ofn or p depend on the behavior ofαn,p(t). If the Wi(s) arechosen asiid functions then, for fixedp, by the law of large numbers,αn,p(t) will converge to a constant asn→ ∞, providedE(

∫Wi(s)Bp(s)TA−1

j ds)2 > 0 for all j. It is then easy to see that the first terms of (22) and(23) will converge to zero at the rate ofn−0.5 while the second terms, which represents the approximationerror, remain fixed as a function ofn.

However, the behavior ofαn,p(t) as a function ofp depends heavily on the value oft and the choice ofBp(t) andA. Consider, for example, the piecewise constant basis that we used in Section 4 i.e. where thekth component ofB(t) equals 1 ift ∈ Rk = t : k−1

p < t ≤ kp and zero otherwise. Then, for any givent ∈ Rk,

it is a simple matter to constructA such thatBp(t)TA−1j = A−1

jk = 0 for all j 6= k in which case

αn,p(t) =

1n

n

∑i=1

(1p

p

∑l=1

WilA−1

lk

A−1kk

)2

−1/2

(24)

whereA−1lk is thel ,kth element ofA−1 andWil is the average ofWi(t) in Rl i.e. p

∫Rl

Wi(s)ds. It is straightfor-ward to construct a matrix A such that (24) is bounded inp for a particular value oft. Consider, for example,the second difference matrix,

A = p2

1/p2 0 0 0 . . . 0 0 0−1/p 1/p 0 0 . . . 0 0 0

1 −2 1 0 . . . 0 0 00 1 −2 1 . . . 0 0 0...

......

..... .

......

...0 0 0 0 . . . 1 −2 1

. (25)

12

Then it is easily shown that the first column ofA−1 is constant (see (31)) and henceA−1l1 /A−1

11 = 1. Thus,

for fixed n, αn,p(0) will converge to(

1n ∑n

i=1(∫

Wi(s)ds)2)−1/2

as p→ ∞. Note that the choice oft = 0 is

arbitrary. We could alternatively, constructA such thatαn,p(t) converges inp for any finite set of time points.In this case the first terms of (22) and (23) grow at only a

√log p rate while the second terms will generally

shrink asωp declines withp. For example, using the piecewise constant basis it is easy to show thatωp

converges to zero at a rate of 1/p providedβ′(t) is bounded. Alternatively using a piecewise polynomialbasis of orderd thenωp converges to zero at a rate of 1/pd+1 providedβ(md1)(t) is bounded. However, weare not aware of any way to constructA such thatαn,p(t) is simultaneously bounded inp for all t. Hence forsome values oft it will be the case thatαn,p(t) grows at the rate of

√p.

5.3 Asymptotic Rates Of Convergence

The bounds presented in Theorem 1 can be used to derive asymptotic rates of convergence forβ(t) asnand p grow. The exact convergence rates are somewhat dependent onthe choice ofBp(t) and A so wefirst state A-1 through A-6, which give general conditions for convergence. We show in Theorem 2 thatthese conditions are sufficient to guarantee convergence for any choice ofBp(t) andA, and then Corollary 2provides specific examples where the conditions can be shownto hold.

A-1 There existsS< ∞ such thatSp ≤ S for all p.

A-2 There existsm> 0 such thatpmωp is bounded i.e.ωp ≤ H/pm for someH < ∞.

A-3 For a given value oft, there existsbt such thatp−bt αn,p(t) is bounded for alln andp.

A-4 There existsc such thatp−csupt αn,p(t) is bounded for alln andp.

A-5 There exists ap∗ such thatδVn,p∗2S + θVn,p∗

S,2S is bounded away from one for large enoughn.

A-6 δVn,pn2S + θVn,pn

S,2S is bounded away from one for large enoughn wheren→ ∞, pn → ∞ andpn/n→ 0.

A-1 states thatA must be chosen such that the sparsity of theγ’s remains bounded asp grows. Sparsity inthe predictors is crucial to the success of the Dantzig Selector. This condition will be shown to hold providedthe appropriate derivatives ofβ(t) are non-zero at only a finite set of points. A-2 assumes that the bias inour estimate forβ(t) converges to zero at the rate ofpm, for somem> 0, as we increase the dimension ofthe basis function,Bp(t). As discussed earlier this will hold for, among other bases,the polynomial splinebasis provided the appropriate derivative ofβ(t) is bounded. A-3 requires thatαn,pn(t) grows no faster thanpbt . This point has already been discussed at length. A-4 is simply a stronger form of A-3 and requiresthat supt αn,pn(t) grows no faster thanpc for somec > 0. A-5 and A-6 both ensure that the design matrixis close enough to orthogonal for (22) to hold and hence imposes a form of identifiability onβ(t). A-5 is afairly weak assumption because it only holds for some finitep and largen. A-6 is a considerably strongerassumption because it holds asp approaches infinity. Hence we present asymptotic results under both theweaker and stronger assumptions. Theorem 2 shows that, for any Bp(t),A andWi(t), under conditions A-1through A-6, the FLRTI estimate forβ(t) will be highly accurate for largen. We present four asymptoticresults ordered from the weakest to the strongest assumptions.

Theorem 2 Suppose the conditions in Theorem 1 hold andεi ∼ N(0,σ21). Then as n→ ∞,

13

1. Suppose A-1 through A-5 all hold and we fix p= p∗. Then, with arbitrarily high probability,∣∣∣βn(t)−β(t)

∣∣∣≤ O(

n−12

)+En(t) and sup

t

∣∣∣βn(t)−β(t)∣∣∣≤ O

(n−

12

)+sup

tEn(t)

where En(t) = Hp∗m

1+Cn,p∗(t)M

√S

.

2. Suppose A-1 through A-5 all hold andε∗i ∼ N(0,σ22). Then, with arbitrarily high probability,

∣∣∣βn(t)−β(t)∣∣∣≤ O

(n−

12

)+

Hp∗m and sup

t

∣∣∣βn(t)−β(t)∣∣∣≤ O

(n−

12

)+

Hp∗m.

3. Suppose we replace A-5 with A-6 but do not assumeε∗i ∼ N(0,σ22). Then, provided bt and c are less

than m, if we let p grow at the rate of n1

2m ,

∣∣∣βn(t)−β(t)∣∣∣= O

(√logn

n12−

bt2m

)and sup

t

∣∣∣β(t)−β(t)∣∣∣= O

(√logn

n12− c

2m

).

4. Suppose we assume A-6 as well asε∗i ∼ N(0,σ22). Then if we let p grow at the rate of n

12m+2bt the rate

of convergence improves to ∣∣∣βn(t)−β(t)∣∣∣= O

( √logn

n12(

mm+bt )

)

or if we let p grow at the rate of n1

2m+2c the supremum converges at a rate of

supt

∣∣∣β(t)−β(t)∣∣∣= O

( √logn

n12(

mm+c)

).

There are obviously many possible choices of basis functionandA matrix. Corollary 2 below provides aspecific example where conditions A-1 to A-4 of Theorem 2 can be shown to hold.

Corollary 2 Suppose we divide the time interval[0,1] into p equal regions and use the piecewise constantbasis. Let A be the second difference matrix given by (25). Suppose that Wi(t) is bounded above zero forall i and t. Then, providedβ′(t) is bounded andβ′′(t) 6= 0 at a finite number of points, A-1, A-2 and A-3 allhold with m= 1, b0 = 0 and bt = 0.5, 0< t < 1. In addition for t bounded away from one A-4 will also holdwith c= 0.5. Hence, if A-5 holds andε∗i ∼ N(0,σ2

2),

∣∣∣βn(t)−β(t)∣∣∣≤ O

(n−

12

)+

Hp∗

and supt

∣∣∣βn(t)−β(t)∣∣∣≤ O

(n−

12

)+

Hp∗

.

Alternatively, if A-6 holds andε∗i ∼ N(0,σ22),

∣∣∣βn(t)−β(t)∣∣∣=

O(√

logn

n12

)t = 0

O

(√logn

n13

)0 < t < 1

and sup0<t<1−a

∣∣∣βn(t)−β(t)∣∣∣= O

(√logn

n13

)

for any a> 0.

14

−30

−20

−10

010

20

Tem

pera

ture

J FMAM J J A SOND −0.

005

0.00

00.

005

0.01

0

Spl

ine

Bet

a

J FMAM J J A SOND −0.

005

0.00

00.

005

0.01

0

GV

S B

eta

J FMAM J J A SOND

Figure 5:a) Smoothed daily temperature curves for 9 of 35 Canadian weather stations. b) Estimated beta curve usinga natural cubic spline. c) Estimated beta curve using FLRTI approach (black) with cubic spline estimate (grey).

a) b) c)

Note that the choice oft = 0 for the faster rate of convergence is simply made for notational convenience.By appropriately choosing A we can achieve this rate for any fixed value oft or indeed for any finite set oftime points. In addition the choice of the piecewise constant basis was made for simplicity. Similar resultscan be derived for higher order polynomial bases in which case A-2 will hold with a higherm and hencefaster rates of convergence will be possible.

6 Canadian Weather Data

In this section we demonstrate the FLRTI approach on a classic functional linear regression data set. Thedata consists of one year of daily temperature measurementsfrom each of 35 Canadian weather stations.Figure 5a) illustrates the curves for 9 randomly selected stations. We also observe the annual rainfall, on thelog scale, at each weather station. The aim is to use the temperature curves to predict annual rainfall at eachlocation. In particular we were interested in identifying the times of the year that have an effect on rainfall.Previous research suggested that temperatures in the summer months may have little or no relationship torainfall whereas temperatures at other times do have an effect. Figure 5b) provides an estimate forβ(t)achieved using a standard approach where aq-dimensional natural cubic spline is used to represent boththepredictors andβ(t). The coefficient curve is then estimated using standard least squares on theq-dimensionalcoefficients for the basis function. In this case we also restricted the values at the start and the end of the yearto be equal. A value ofq = 4 was chosen using cross-validation. The curve suggests a positive relationshipbetween temperature and rainfall in the fall months and a negative relationship in the spring. There alsoappears to be little relationship during the summer months.However, because of the restricted functionalform of the curve, there are only two points whereβ(t) = 0.

15

−0.

010

0.00

00.

005

0.01

00.

015

0.02

0

Bet

a

−0.

010

0.00

00.

005

0.01

00.

015

0.02

0

Bet

a

0 100 200 300 400 500

0.0

0.2

0.4

0.6

R s

quar

ed

Figure 6:a) and b) Estimated beta curves from constraining the first and third derivatives respectively. The dashedlines represent95% confidence intervals. c) R2 from permuting the response variable500 times. The grey linerepresents the observed R2 from the true data.

a) b) c)

The corresponding estimate from the FLRTI approach, with the zeroth and second derivatives restricted,is presented in Figure 5c) (black line) with the spline estimate in grey. The choice ofλ, as well as the weighton the zeroth derivative forΩ, were made using ten fold cross-validation. The FLRTI estimate also indicatesa negative relationship in the spring and a positive relationship in the late fall. However, because of the farmore flexible form of the curve, we are able to produce an estimate of exactly zero relationship over thesummer and winter months. This produces a much more easily interpretable picture. In addition there isstrong evidence that the FLRTI estimates are more accurate than those for the spline method with 35 foldcross-validated sum of squared errors of 3.95 vs 7.33 respectively.

In addition to estimates forβ(t) one can also easily generate confidence intervals and tests of signif-icance. We illustrate these ideas in Figure 6. Pointwise confidence intervals onβ(t) can be produced bybootstrapping the pairs of observationsYi ,Wi(t), reestimatingβ(t) and then taking the appropriate em-pirical quantiles from the estimated curves at each time point. Figures 6a) and b) illustrate the estimatesfrom restricting the first and third derivatives respectively along with the corresponding 95% confidenceintervals. Both figures closely resemble the estimate from Figure 5c). Restricting the first derivative pro-duces a stepwise constant function while restricting the third derivative gives a smoother and hence morevisually appealing, though slightly less interpretable, estimate forβ(t). The confidence intervals confirmthe statistical significance of the positive relationship in the fall months. The significance of the negativerelationship in the spring months is less clear since the upper bound is at zero. However, this is somewhatmisleading because approximately 96% of the bootstrap curves did include a dip in the spring but, becausethe dips occurred at slightly different times, their effectcanceled out to some extent. Some form of curveregistration may be appropriate but we will not explore thathere. Note that the bootstrap estimates also con-

16

sistently estimate zero relationship during the summer months providing further evidence that there is littleeffect from temperature in this period. Finally, Figure 6c)illustrates a permutation test we developed fortesting statistical significance of the relationship between temperature and rainfall. The grey line indicatesthe value ofR2 (0.73) for the FLRTI method applied to the weather data. We then permuted the responsevariable 500 times and for each permutation computed the newR2. All 500 permutedR2’s were well below0.73, providing very strong evidence of a true relationship.

7 Discussion

The approach presented in this paper takes a departure from the standard regression paradigm where onegenerally attempts to minimize anL2 quantity, such as the sum of squared errors, subject to an additionalpenalty term. Instead we attempt to find the sparsest solution, in terms of various derivatives ofβ(t), subjectto the solution providing a reasonable fit to the data. By directly searching for sparse solutions we are able toproduce estimates that have far simpler structure than thatfrom traditional methods while still maintainingthe flexibility to produce more complicated coefficient curves when required. The theoretical bounds derivedin Section 5, which show the error rate can grow as slowly as

√log p, as well as the empirical results, suggest

that one can choose an extremely flexible basis, in terms of a large value forp, without sacrificing predictionaccuracy.

There has been some previous work along these lines. For example, Tibshiraniet al. (2005) uses anL1

lasso type penalty on both the zeroth and first derivatives ofa set of coefficients to produce an estimate whichis both exactly zero over some regions and exactly constant over other regions. Valdes-Sosaet al. (2005)also uses a combination of bothL1 andL2 penalties on fMRI data. Probably, the work closest to ours isarecent approach by Lu and Zhang (2006) called the “functional smooth lasso” (FSL). The FSL uses a lassotype approach by placing anL1 penalty on the zeroth derivative and anL2 penalty on the second derivative.This is a nice approach and, as with FLRTI, produces regions whereβ(t) is exactly zero. However, ourapproach can be differentiated from these other methods in that we consider derivatives of different order soFLRTI can generate piecewise constant, linear, and quadratic sections. In addition FLRTI possess interestingtheoretical properties in terms of the non-asymptotic bounds.

An obvious area for future work would involve generalizing the linear functional model we have used.For example, we believe that the linear programming algorithm, as well as the theoretical results, fromthe Dantzig Selector could be extended to the case of Functional Generalized Linear Models where theresponse distribution could be made more general. Such an extension would allow one to use our methodon classification type problems with categorical responseswhich are common in FDA settings.

A Proof of Theorem 1

First we state one of the key results from Candes and Tao (2006).

Theorem 3 (Candes and Tao (2006))Let Y = Xγ + ε whereX has norm one columns. Suppose thatγ isan S-sparse vector withδX

2S+θXS,2S< 1. Let γ be the corresponding solution from the Dantzig Selector. Then

||γ− γ|| ≤ 4λ√

S

1−δX2S−θX

S,2S

17

provided thatmax|XTε| ≤ λ.

This result is proved in Section 3.2 of Candes and Tao (2006).First note that the functional linear regression model given by (12) can be reexpressed as,

Y = Vγ+ ε∗ = V γ+ ε∗, (26)

whereV = VD−1v is the standardized version ofV with columns having norm one andγ = Dvγ. Hence, by

Theorem 3,

||Dvγ−Dvγ|| = ||γ− γ|| ≤ 4λ√

S

1−δV2S−θV

S,2S

provided (20) holds.But β(t) = Bp(t)TA−1γ = Bp(t)TA−1D−1

vγ while β(t) = Bp(t)Tη + ep(t) = Bp(t)TA−1D−1

v γ + ep(t).Then

∣∣∣β(t)−β(t)∣∣∣ ≤

∣∣∣β(t)−Bp(t)Tη∣∣∣+ |ep(t)|

= ||Bp(t)TA−1D−1

v

(γ− γ

)||+ |ep(t)|

≤ ||Bp(t)TA−1D−1

v || · ||γ− γ||+ |ep(t)|≤ ||Bp(t)

TA−1D−1v || · ||γ− γ||+ ωp

=1√n

αn,p(t)||γ− γ||+ ωp

≤ 1√n

4αn,p(t)λ√

Sp

1−δV2Sp

−θVSp,2Sp

+ ωp.

B Proof of Corollary 1

Substitutingλ = σ1√

2(1+ φ) log p+ Mω√

n into (21) gives (22). Letε′i =∫

Wi(t)ep(t)dt. Then to showthat (20) holds with the appropriate probability note that

|VTj ε∗| = |VT

j ε+VTj ε′|

≤ |VTj ε|+ |VT

j ε′|= σ1|Z j |+ |VT

j ε′|≤ σ1|Z j |+Mω

√n

18

whereZ j ∼ N(0,1). The third line follows from the fact thatVj is norm one and, sinceεi ∼ N(0,σ1), it willbe the case thatVT

j ε ∼ N(0,σ1). Hence

P

(max

j|VT

j ε∗| > λ)

= P

(max

j|VT

j ε∗| > σ1

√2(1+ φ) log p+Mω

√n

)

≤ P

(max

j|Z j | >

√2(1+ φ) log p

)

≤ p1√2π

exp−(1+ φ) log p/√

2(1+ φ) log p

=(

pφ√

4(1+ φ)π log p)−1

.

The penultimate line follows from the fact thatP(supj |Z j | > u) ≤ pu

1√2π exp(−u2/2).

In the case whereε∗i ∼ N(0,σ22) then substitutingλ = σ2

√(1+ φ) log p into (21) gives (23). In this case

VTj ε∗ = σ2Z j whereZ j ∼ N(0,1). Hence

P

(max

j|VT

j ε∗| > σ2

√(1+ φ) log p

)= P

(max

j|Z j | >

√2(1+ φ) log p

)

and the result follows in the same manner as above.

C Proof of Theorem 2

Part 1

By Corollary 1, forp = p∗,

∣∣∣β(t)−β(t)∣∣∣≤ 1√

nCn,p∗(t)σ1

√2Sp∗(1+ φ) log p∗ + ωp∗

1+Cn,p∗(t) ·M

√Sp∗

with arbitrarily high probability providedφ is large enough. But, by A-1,Sp∗ is bounded, and, by A-3 andA-5, Cn,p∗(t) is bounded for largen. Hence, sincep∗,σ1 andφ are fixed, the first term on the right hand sideis O(n−1/2). Finally, by A-2, ωp∗ ≤ H/p∗

mand, by A-1,Sp∗ ≤ S so the second term of the equation is at

mostEn(t). The result for supt

∣∣∣β(t)−β(t)∣∣∣ can be proved in an identical fashion by replacing A-3 by A-4.

Part 2

By Corollary 1, if we assumeε∗i ∼ N(0,σ22), then, forp = p∗,

∣∣∣β(t)−β(t)∣∣∣≤ 1√

nCn,p∗(t)σ2

√2Sp∗(1+ φ) log p∗ + ωp∗

with arbitrarily high probability providedφ is large enough. Then we can show that the first term isO(n−1/2)

in exactly the same fashion as for Part 1. Also, by A-2, the second term is bounded byH/p∗m.

19

Part 3

By A-1 there existsS< ∞ such thatSp < S for all p. Hence, by (22), settingφ = 0, with probabilityconverging to one asp→ ∞,

∣∣∣β(t)−β(t)∣∣∣ ≤ 1√

n4αn,pn(t)

1−δV2S−θV

S,2S

σ1

√2Slog pn + ωpn

1+4αn,pn(t)

1−δV2S−θV

S,2S

M√

S

=pbt

n√

logpn√n

4p−btn αn,pn(t)

1−δV2S−θV

S,2S

σ1

√2S+

ωpn pmn

pm−btn

p−bt

n +4p−bt

n αn,pn(t)

1−δV2S−θV

S,2S

·M√

S

(27)

=

√logn

n12−

bt2m

const (28)

where

const=( pn

n1/2m

)bt

√log pn

logn4p−bt

n αn,pn(t)

1−δV2S−θV

S,2S

σ1

√2S+

( pn

n1/2m

)bt−m ωpn pmn√

logn

p−btn +

4p−btn αn,pn(t)

1−δV2S−θV

S,2S

·M√

S

(29)But if we let pn = O(n1/2m) then (29) is bounded because

• pn/n1/2m and logpn/ logn are bounded by construction ofpn.

• ωpn pmn is bounded by A-2.

• p−btn αn,pn is bounded by A-3.

• (1−δV2S−θV

S,2S)−1 is bounded by A-6.

Hence∣∣∣βn(t)−β(t)

∣∣∣ = O

( √logn

n12− bt

2m

). With the addition of A-4 exactly the same argument can be usedto

prove supt

∣∣∣β(t)−β(t)∣∣∣= O

( √logn

n12− c

2m

).

Part 4

If ε∗i ∼ N(0,σ22) then, by (23), settingφ = 0, with probability converging to one asp→ ∞,

∣∣∣β(t)−β(t)∣∣∣≤

√logn

n12(

mm+bt )

const

where

const=

(pn

n

(1

2(m+bt )

)

)bt√

logpn

logn4p−bt

n αn,pn(t)

1−δV2S−θV

S,2S

σ2

√2S+

(pn

n

(1

2(m+bt )

)

)−mpm

n ωp√logn

. (30)

Hence ifpn = O(n1/2(m+b)) then (30) is bounded, using the same arguments as with (29), so∣∣∣βn(t)−β(t)

∣∣∣=

O

( √logn

n12( m

m+bt )

). We can prove that supt

∣∣∣βn(t)−β(t)∣∣∣= O

( √logn

n12( m

m+c)

)in the same way.

20

D Proof of Corollary 2

Thoughout this proof letηk = β(k/p). First we show A-1 holds. Suppose thatβ′′(t) = 0 for all t in Rk−2,Rk−1

andRk then there existb0 andb1 such thatβ(t) = b0 +b1t over this region. Hence, fork≥ 2,

γk = p2 (ηk−2−2ηk−1+ ηk)

= p2 (β((k−2)/p)−2β((k−1)/p)+ β(k/p))

= p2(

b0 +b1k−2

p−2b0−2b1

k−1p

+b0 +b1kp

)= 0

But note that ifβ′′(t) 6= 0 at no more thanSvalues oft then there are at most 3S triples such thatβ′′(t) 6= 0for somet in Rk−2,Rk−1 andRk. Hence there can be no more than 3S+ 2 γk’s that are not equal to zero(where the two comes fromγ1 andγ2).

Next we show A-2 holds. For anyt ∈ Rk, B(t)T η = ηk. But since|β′(t)| < G for someG < ∞ andRk is of length 1/p it must be the case that supt∈Rk

β(t)− inft∈Rk β(t) ≤ G/p. Let ηk be any value betweensupt∈Rk

β(t) and inft∈Rk β(t) for k = 1, . . . p. Then

ωp = supt|β(t)−B(t)T η| = max

ksupt∈Rk

|β(t)−ηk| ≤ maxk

supt∈Rk

β(t)− inft∈Rk

β(t)

≤ G/p

so A-2 holds withm= 1.

Now, we show A-3 holds. Fort ∈ Rk let Ln j(t) = 1n ∑n

i=1

(1p ∑p

l=1WilA−1

l j

A−1k j

)2

whereA−1lk is the l ,kth

element ofA−1 andWil is the average ofWi(t) in Rl i.e. p∫

RlWi(s)ds. Then

αn,p(t) =

√√√√p

∑j=1

Ln j(t)−1.

SinceWi(t) is bounded above zeroLn j(t) ≥W2ε

(1p ∑p

l=1A−1

l j

A−1k j

)2

for someWε > 0. It is easily verified that

A−1 =1p2

p2 0 0 0 . . . 0p2 p 0 0 . . . 0p2 2p 1 0 . . . 0p2 3p 2 1 . . . 0...

......

.... . . 0

p2 p(p−1) p−2 p−3 . . . 1

. (31)

and hence

A−1l j

A−1k j

=

∞ k < j, l ≥ j

0 k≥ j, l < jl− j+1k− j+1 k≥ j, l ≥ j

except for j = 1 in which case the ratio equals one for alll andk. For t = 0 thent ∈ R1 (i.e. k = 1) and

21

henceLn1(0) ≥W2ε while Ln j(0) = ∞ for all j > 1. Therefore,αn,p(0) ≤W−1

ε for all p. Hence, A-3 holdswith b0 = 0. Alternatively, for 0< t < 1, thenk = ⌊pt⌋ andLn j(t) = ∞ for j > k. Hence forj ≤ k,

Ln j(t) ≥ W2ε

1p2

(p

∑l= j

l − j +1k− j +1

)2

= W2ε

1p2

((p− j +1)(p− j +2)

2(k− j +1)

)2

≥ W2ε

1p2 p4(1− t)2

4

(1

k− j +1

)2

since j ≤ pt.

Therefore

αn,p(t) ≤2

(1− t)Wε p

√√√√k

∑j=1

(k− j +1)2 =2

(1− t)Wε p

√k6(k+1)(2k+1) ≤ p1/2 2

(1− t)Wεt3/2 (32)

sincek = ⌊pt⌋. Hence A-3 holds withbt = 1/2 for 0< t < 1.Finally, note that (32) holds for anyt < 1 and is increasing int so

sup0<t<1−a

αn,p(t) ≤ p1/2 2aWε

for anya > 0 and hence A-4 holds withc = 1/2.

E Alternative Implementations of GVS

For the functional linear regression problem we have concentrated on the GVS problem withK ≥ p. Here webriefly consider the situation forK < p. Let A(1) represent the firstK columns ofA andA(2) the remainder.Similarly, let β(1) andX(1)i represent the firstK elements ofβ andX i respectively and letβ(2) andX(2)i

represent the remaining elements. Then we can assume, without loss of generality, that the constraints inAare ordered in such a way thatA(1) is invertible. In this case it is easily shown that (3) can be expressed as

Yi = α+XTγiγ+XT

βiβ(2) + εi i = 1, . . . ,n, (33)

whereXγi = A−1(1)

TX(1)i andXβi = X(2)i −AT

(2)A−1(1)

TX(1)i. Hence the GVS problem reduces to one of selecting

the non-zero components ofγ subject to the model given by (33). Note that this will only befeasible ifn > p−K because there is no assumption of sparsity inβ.

References

Breiman, L. (1995). Better subset regression using the non-negative garrote.Technometrics37, 373–384.

Candes, E. and Tao, T. (2005). Decoding by linear programming. IEEE Trans. Inform. Theory51, 4203–4215.

22

Candes, E. and Tao, T. (2006). The Dantzig selector: statistical estimation when p is much larger than n.Working Paper.

Chen, S., Donoho, D., and Saunders, M. (1998). Atomic decomposition by basis pursuit.SIAM Journal ofScientific Computing20, 33–61.

Fahrmeir, L. and Tutz, G. (1994).Multivariate Statistical Modeling Based on Generalized Linear Models.Springer-Verlag.

Fan, J. and Li, R. (2001). Variable selection via nonconcavepenalized likelihood and its oracle properties.Journal of the American Statistical Association96, 1348–1360.

Fan, J. and Zhang, J. (2000). Two-step estimation of functional linear models with applications to longitu-dinal data.Journal of the Royal Statistical Society, Series B62, 303–322.

Faraway, J. (1997). Regression analysis for a functional response.Technometrics39, 254–261.

Ferraty, F. and Vieu, P. (2002). The functional nonparametric model and applications to spectrometric data.Computational Statistics17, 545–564.

Ferraty, F. and Vieu, P. (2003). Curves discrimination: a nonparametric functional approach.ComputationalStatistics and Data Analysis44, 161–173.

Hastie, T. and Mallows, C. (1993). Comment on “a statisticalview of some chemometrics regression tools”.Technometrics35, 140–143.

Hoover, D. R., Rice, J. A., Wu, C. O., and Yang, L. P. (1998). Nonparametric smoothing estimates oftime-varying coefficient models with longitudinal data.Biometrika85, 809–822.

James, G. M. (2002). Generalized linear models with functional predictors.Journal of the Royal StatisticalSociety, Series B64, 411–432.

James, G. M. and Hastie, T. J. (2001). Functional linear discriminant analysis for irregularly sampled curves.Journal of the Royal Statistical Society, Series B63, 533–550.

James, G. M. and Silverman, B. W. (2005). Functional adaptive model estimation.Journal of the AmericanStatistical Association100, 565–576.

Liang, K. Y. and Zeger, S. L. (1986). Longitudinal data analysis using generalized linear models.Biometrika73, 13–22.

Lin, D. Y. and Ying, Z. (2001). Semiparametric and nonparametric regression analysis of longitudinal data.Journal of the American Statistical Association96, 103–113.

Lu, Y. and Zhang, C. (2006). Spatially adaptive functional linear regression with functional smooth lasso.Under Review.

Mallat, S. and Zhang, Z. (1993). Matching pursuit in a time-frequency dictionary.IEEE Transactions onSignal Processing41, 3397–3415.

23

Muller, H. G. and Stadtmuller, U. (2005). Generalized functional linear models.Annals of Statistics33,774–805.

Ramsay, J. O. and Silverman, B. W. (2002).Applied Functional Data Analysis. Springer.

Ramsay, J. O. and Silverman, B. W. (2005).Functional Data Analysis. Springer, 2nd edn.

Tibshirani, R. (1996). Regression shrinkage and selectionvia the lasso.Journal of the Royal StatisticalSociety, Series B58, 267–288.

Tibshirani, R., Saunders, M., Rosset, S., and Zhu, J. (2005). Sparsity and smoothness via the fused lasso.Journal of the Royal Statistical Society, Series B67, 1, 91–108.

Valdes-Sosa, P. A., Sanchez-Bornot, J, M., Lage-Castellanos, A., Vega-Hernandez, M., Bosch-Bayard, J.,Melie-Garcia, L., and Canales-Rodriguez, E. (2005). Estimating brain functional connectivity withsparse multivariate autoregression.Philosophical Transactions of the Royal Society, B360, 969–981.

Wu, C. O., Chiang, C. T., and Hoover, D. R. (1998). Asymptoticconfidence regions for kernel smoothingof a varying-coefficient model with longitudinal data.Journal of the American Statistical Association93, 1388–1402.

Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net.Journal of the RoyalStatistical Society, Series B67, 301–320.

24