arxiv.org · high dimensional regression for regenerative time-series: an application to road tra c...

High dimensional regression for regenerative time-series: an

application to road traffic modeling

Mohammed Bouchouia And Francois Portier

Telecom Paris, Institut Polytechnique de Paris

Abstract

This paper investigates statistical models for road traffic modeling. The proposed method-ology considers road traffic as a (i) high-dimensional time-series for which (ii) regenerationoccurs at the end of each day. Since (ii), prediction is based on a daily modeling of theroad traffic using a vector autoregressive model that combines linearly the past observa-tions of the day. Considering (i), the learning algorithm follows from an `1-penalization ofthe regression coefficients. Excess risk bounds are established under the high-dimensionalframework in which the number of road sections goes to infinity with the number of observeddays. Considering floating car data observed in an urban area, the approach is compared tostate-of-the-art methods including neural networks. In addition of being very competitive interms of prediction, it enables to identify the most determinant sections of the road network.

Keywords— MSC 2010 subject classifications: Primary 62J05, 62J07, ; secondary 62P30.vector autoregressive model, Lasso, regenerative process, road traffic prediction

1 Introduction

Due to the side effects of traffic congestion, including for instance, pollution and economically ineffectivetransportation (Bull, 2003; Pellicer et al., 2013; Harrison and Donnelly, 2011), achieving smart mobilityhas become one of the leading challenges of emerging cities (Washburn et al., 2009). To set up effectivesolutions, such as developing intelligent transportation management systems for urban planners, or ex-tending the road network efficiently, smart cities need to understand road traffic precisely. This paperinvestigates interpretable predictive models estimated from floating car data, allowing to identify thedeterminants of road traffic.

The proposed methodology addresses two important issues related to floating car data. First, aparticular point contrasting with traditional time series analysis (Brockwell et al., 1991) is that thenumber of vehicles using the network almost vanishes during each night. Hence in terms of probabilisticdependency, the road traffic between the different days is assumed to be independent. Such a phenomenonis referred to as “regeneration” in the Markov chains literature (Meyn and Tweedie, 2012), and we saythat the road network “regenerates” at the beginning of each new day. Second, the size of the roadnetwork, especially in urban areas, can be relatively large compared to the number of observed days.This implies that the algorithms employed must be robust to the well-known high-dimensional regression(Giraud, 2014) setting in which the number of features is large.

For any time t ∈ 0, . . . , T of day i ∈ 1, . . . , n, denote by W(i)t ∈ Rp the vector of speeds registered

in the road network. Hence p, n and T + 1 stand for the number of sections in the network, the numberof days in the study, and the number of time instant within each day. Inspired by time series analysis,the proposed model is similar to the popular vector auto-regressive (VAR) model, as described in theeconometric literature (Hamilton, 1994), with the difference that it only applies within each day due tothe regeneration property. We therefore consider the following linear regression model, called regenerativeVAR,

L(W(i)t ) = bt +AW

(i)t , for each t ∈ 0, . . . , T − 1, i ∈ 1, . . . , n, (1)

1

arX

iv:1

910.

1109

5v4

[m

ath.

ST]

24

Sep

2020

with A ∈ Rp×p and bt ∈ Rp. This model shall be used to predict the next value W(i)t+1. The parameter A

encodes for the influence between different road sections and the parameters (bt)t=0,...,T−1 account forthe daily (seasonal) variations.

The approach taken for estimating the parameters of the regenerative VAR, (bt)t=0,...,T−1 and A,follows from applying ordinary least-squares (OLS) while penalizing the coefficients of A using the `1-norm just as in the popular LASSO procedure (Tibshirani, 1996); see also Buhlmann and Van De Geer(2011); Giraud (2014); Hastie et al. (2015) for reference textbooks. The estimator is computed byminimizing over (bt)t=0,...,T−1 ∈ Rp×T and A ∈ Rp×p the following objective function

n∑i=1

T−1∑t=0

‖W (i)t+1 − bt −AW

(i)t ‖2F + 2λ‖A‖1, (2)

where ‖A‖1 =∑

16k,`6p |Ak,`| and ‖ · ‖F stands for the Frobenius norm. Whereas the estimation ofstandard VAR models (without regeneration) is well-understood for decades (Brockwell et al., 1991),only recently penalization has been introduced to estimate the model coefficients, see among others(Valdes-Sosa et al., 2005; Wang et al., 2007; Haufe et al., 2010; Song and Bickel, 2011; Michailidis anddAlche Buc, 2013; Kock and Callot, 2015; Basu et al., 2015; Baek et al., 2017). The previous referencesadvocate for the use of the LASSO or some of its variants in time-series prediction when the dimensionof the time series is relatively large. Other approaches achieving selection in VAR models but withoutusing the LASSO are proposed in Davis et al. (2016).

Because of the regeneration property, the mathematical tools used in this paper differ from the onesof the time-series literature quoted above. The independence between the days allows us to rely onstandard results dealing with sums of independent random variables and therefore to build upon resultsdealing with the LASSO based on independent data (Bickel et al., 2009). From an estimation point ofview, two key aspects are related to the regenerative VAR model: (i) the regression output is a high-dimensional vector of size p; (ii) the model is linear with random covariates. This last point makes ourstudy related to random design analyses of the LASSO (Bunea et al., 2007; Van De Geer and Buhlmann,2009). An additional issue related to the regenerative VAR model is the one of model switching. Modelswitching occurs when the use of two different matrices A, each within two different time period, improvesprediction. This kind of variation in the predictive model might simply be caused by some changes inthe road traffic intensity. Hence a key question related to the regenerative VAR model is the detectionof such change points based on observed data.

From a theoretical standpoint, we adopt an asymptotic framework which captures the nature ofusual road traffic data in which p and n are growing to infinity (p might be larger than n) whereas Tis considered to be fixed. We first establish a bound on the predictive risk, defined as the normalizedprediction error, for the case λ = 0. This situation corresponds to ordinary least-squares. The orderis in p/n and therefore deteriorates when the value of p is getting large. Moreover, the conditions forits validity are rather strong because they require the smallest eigenvalue of the Gram matrix to besufficiently large. Then under a sparsity assumption on the matrix A, claiming that each line of A hasa small number of non-zero coefficient (each section is predicted based on a small number of sections),we study the regularized case when λ > 0. In this case, even when p is much larger than n, we obtain abound of order 1/n (up to a logarithmic factor), and the eigenvalue condition is alleviated as it concernsonly the eigenvalues restricted to the active variables. Those results are finally used to demonstrate theconsistency of a cross-validation change point detection procedure under regime switching.

From a practical perspective, the regenerative VAR, in virtue of its simplicity, contrasts with thepast approaches, mostly based on deep neural networks, that have been used to handle road traffic data.For instance, in (Lv et al., 2014), autoencoders are used to extract spatial and temporal informationfrom the data before prediction. In (Dai et al., 2017), multilayer perceptrons (MLP) and long short-termmemory (LSTM) are combined together to analyze time-variant trends in the data. Finally, LSTM withspatial graph convolution are employed in (Lv et al., 2018; Li et al., 2017). We now point out threeadvantages of the proposed approach:

(i) It enables to consider very large road networks including typically not only the main roads butalso the primary roads. The previous neural networks methods either use data collected from fixedsensors on the main roads (Dai et al., 2017; Lv et al., 2014) or, when using floating car data,restrict the network to the main roads ignoring primary roads (Epelbaum et al., 2017). Even if

2

(a)

(b) (c)

(d) (e)

Figure 1: (a) Rennes road network is colored in blue and 3 representative sections are colored ingreen, black and red; (b) speeds observed during 6 days of December 2018 in the green section(cyclic behavior); (c) in the black section (high speed section); (d) in the red section (low speedsection); (e) average speed over all sections in the road network during the same 6 days ofDecember 2018.

this prevents from overfitting, as it avoids a large number of features compared to the sample size,there might be some loss of information in reducing the data to an arbitrary subset.

(ii) The estimated coefficients are easily interpretable due to the linear model and the LASSO selec-tion procedure which shrinks to zero irrelevant sections. This very last point provides data-drivengraphical representations of the dependency within the network that could be useful for road main-tenance. Once again, this is contrasting with complex deep learning models in which interpretationis known to be difficult.

(iii) Changes in the distribution of road traffic during the day can be handled easily using a regimeswitching approach. It actually consists in a simple extension of the initial regenerative VARproposed in (1) in which the matrix A is allowed to change during time.

To demonstrate the practical interest of the proposal, the data used is concerned with the urbanarea of a French city, Rennes, made of n = 144 days, p = 556 road sections and T + 1 = 20 time instants(from 3pm to 8pm) within each day (see Figure 1). Among all the considered methods, including classicalbaseline from time series analysis as well as the most recent neural network architecture, this is the regimeswitching model that has the best performance.

The outline is as follows. In Section 2, the probabilistic framework is introduced. The optimal linearpredictor is characterized, and the main assumptions are discussed. In Section 3, we present the maintheoretical results of the paper that are bounds on the prediction error of (2). Section 4 investigatesthe regime switching variant. A comparative study including different methods applied to the real datapresented before is proposed in Section 5. A simulation study is conducted in Section 6. All the proofsof the stated results are gathered in the Appendix.

2 Probabilistic framework

Let S = 1, . . . , p denote an index set of p > 1 real-valued variables that evolve during a time period T =0, . . . , T. These variables are gathered in a matrix of size p× (T + 1) denoted by W = (Wk,t)k∈S, t∈T .

3

In road traffic data, k ∈ S stands for the section index of the road network S, t ∈ T for the time instantswithin the day, and Wk,t is the speed recorded at section k and time t. We denote by W·,t ∈ Rp thevector of speed at time t ∈ T . The following probabilistic framework will be adopted throughout thepaper.

Assumption 1 (probabilistic framework). Let (Ω,F , P ) be a probability space. The matrix W =(W·,0, . . . ,W·,T) ∈ Rp×(T+1) is a random element valued in Rp×(T+1) with distribution P . For allk ∈ S and t ∈ T , we have E[W 2

k,t] < +∞ (E stands for the expectation with respect to P ).

The task of interest is to predict the state variable at time t, W·,t, using the information availableat time t − 1, W·,t−1. The model at use to predict W·,t is given by the mapping L(W·,t−1) =bt +AW·,t−1, where A ∈ Rp×p and bt ∈ Rp are parameters that shall be estimated. Define

Y = (W·,1, . . . ,W·,T) and X = (W·,0, . . . ,W·,T−1).

Equipped with this notation, the problem can be expressed as a simple matrix-regression problem withcovariate X ∈ Rp×T and output Y ∈ Rp×T . In matrix notation, the model simply writes

L = L(X) = b+AX : b ∈ Rp×T , A ∈ Rp×p, (3)

with L(X·,1, . . . , X·,T) = (L(X·,1), . . . , L(X·,T)). The number of parameters needed to describea single element in L is p(p + T ). Because the matrix A is fixed over the period T , this model reflectsa certain structural belief on the dependence along time of the state variable W . Each element in L isdefined by a baseline matrix b ∈ Rp×T which carries out the average behavior in the whole network and amatrix A ∈ Rp×p which encodes for the influence between the different road sections. Alternative modelsare discussed at the end of the section. The accuracy of a given model L ∈ L is measured through the(normalized) L2(P )-risk given by

R(L) = p−1E[‖Y − L(X)‖2F ].

Note that, because of the factor 1/p, R(L) is just the averaged squared error of L. Among the class L,there is an optimal predictor characterized by the usual normal equations. This is the statement of thenext proposition.

Proposition 1. Suppose that Assumption 1 is fulfilled. There is a unique minimizer L? ∈ argminL∈LR(L).Moreover L = L? if and only if E[Y − L(X)] = 0 and E[(Y − L(X))XT ] = 0.

The following decomposition of the risk underlines the prediction loss associated to the predictor Lby comparing it with the best predictor L? in L. Define the excess risk by

E(L) = R(L)−R(L?), L ∈ L.

Proposition 2. Suppose that Assumption 1 is fulfilled. It holds:

∀L ∈ L, E(L) = p−1E[‖L(X)− L?(X)‖2F ].

In what follows, bounds will be established on the excess risks in an asymptotic regime where bothn and p := pn go to infinity whereas T will remain fixed. Though theoretic, this regime is of practicalinterest as it permits to analyze cases where the number of sections p is relatively large (possibly greaterthan n). This implies that the sequence of random variables of interest must be introduced as a triangulararray. For each n > 1, we observe n realizations of W : W (1), . . . ,W (n) where for each i, W (i) ∈ Rp×(T+1).The following modeling assumption will be the basis to derive theoretical guarantees on E(L).

Assumption 2 (daily regeneration). For each n > 1, (W (i))16i6n is an independent and identicallydistributed collection of random variables defined on (Ω,F , P ).

The daily regeneration assumption has two components: the independence between the days and theprobability distribution of each day which remains unchanged. The independence is in accordance withthe practical use of road network where at night only a few people use the network so it can regenerateand hence forgets its past. The fact that each day has the same distribution means essentially that onlydays of similar types are gathered in the data. For instance, this assumption might not hold when mixingweekend days and workdays. This assumption also implies that the network structure remains the sameduring data collection.

4

Empirical check of the daily regeneration assumption To check the daily regenerationassumption on the real data presented in the introduction, we extract a set of p(T + 1) = 11120 one-dimensional time series. Each of these is associated to a specific instant in the day t and a specific section

k, i.e., each time series is given by (W(i)k,t)i=1,...,n using the introduced notation. For each of series, we run

the test presented in Broock et al. (1996) where the null hypothesis is that the time series is independentand identically distributed. For most sections, 88%, the null hypothesis is not rejected at a nominal levelof 95%. This clearly validates the daily regeneration assumption. The share of sections, 12%, for whichthe hypothesis has been rejected, can be explained by the presence of missing values in the data whichhave been replaced by the average value (see Section 5 for details). It is also likely that external eventssuch as roadworks and constructions have impacted the result of the test.

Alternative models Two alternative modeling approaches might have been considered at the priceof additional notation and minor changes in the proofs of the results presented in the next section. Thefirst model is obtained by imposing the matrix A to be diagonal. In this case, each 1-dimensional vectorcoordinates is fitted using an auto-regressive model based on one single lag. The second class is to usemore than one lag, say H > 1, to predict the next coming instance. This is done by enlarging thematrix A to (A1, . . . , AH) where each Ah ∈ Rp×p and by stacking in X the H previous lags. Thesetypes of variations though interesting, are not presented in the paper for the sake of readability. Anotheralternative modeling approach which will be addressed in Section 4 consists in allowing the matrix A tochange across time and to detect those changes from the data.

3 Empirical risk minimization

3.1 Definitions and first results

Given a sequence (W (i))16i6n ⊂ Rp×(T+1) and a regularization parameter λ > 0, define the estimate

(b, A) ∈ argminb∈Rp×T , A∈Rp×p

(n∑i=1

‖Y (i) − b−AX(i)‖2F

)+ 2λ‖A‖1, (4)

with

Y (i) = (W(i)·,1, . . . ,W

(i)·,T) and X(i) = (W

(i)·,0, . . . ,W

(i)·,T−1).

The prediction at point X ∈ Rp×T is given by L(X) = b+ AX. As the intercept in classical regression,

the matrix parameter b is only a centering term. Indeed, when minimizing (4) with respect to b only,we find b = n−1

∑ni=1Y (i) − AX(i). The value of A can thus be obtained by solving the least-squares

problem (4) without intercept and with empirically centered variables. Given A, the matrix b can be

recovered using the simple formula b = n−1∑ni=1Y (i) − AX(i). Consequently, the prediction at point

X may be written as L(X) = Yn

+ A(X−Xn), where, generically, M

n= n−1

∑ni=1M

(i). Similarly, onehas L?(X) = E[Y ] +A?(X −E(X)). The previous two expressions emphasize that the excess risk mightbe decomposed according to 2 terms: one is dealing with the estimation error on A? and one is relativeto the error on the averages E[Y ] and E[X]. We now state this decomposition.

Proposition 3. Suppose that Assumption 1 is fulfilled. It holds that

E(L) 6 p−1‖(A? − A)Σ1/2‖2F + 2‖E(Y )− Y n‖2F + 2‖A(E(X)−Xn)‖2F .

where Σ = E[(X − E(X))(X − E(X))T ].

The previous proposition will be the key to control the excess risk of both the OLS and LASSOpredictors. The OLS (resp. LASSO) estimate, which results from (4) when λ = 0 (resp. λ > 0), is

denoted by (b(ols), A(ols)) (resp. (b(lasso), A(lasso))) and its associated predictor is given by L(ols) (resp.L(lasso)).

5

3.2 Ordinary least-squares

In this section, we study the case of the OLS (i.e., when λ = 0). This will allow to put into perspectivethe main results of the paper, concerning the LASSO (i.e., when λ > 0), that will be given in the nextsection. We now introduce certain assumptions dealing with the distribution of W (i), which depends onn trough p = pn. The following one claims that the covariates are uniformly bounded.

Assumption 3 (bounded variables). With probability 1,

lim supn→∞

maxk∈S, t∈T

|Xk,t| < ∞.

The following invertibility condition on Σ can be seen as an identification condition since the matrixA? is unique under this hypothesis. Denote by γ > 0 the smallest eigenvalue of Σ.

Assumption 4. lim infn→∞ γ > 0 .

Finally, introduce the noise level σ2 > 0 which consists in a bound on the conditional variance of theresidual matrix

ε = Y − L?(X),

given the covariates X. Formally, σ2 is the smallest positive real number such that, with probability 1,maxk∈S, t∈T E[ε2k,t|X·,t] 6 σ2. Note that σ2 might depend on p. The fact that it does not depend onX stresses the homoscedasticity of the regression model.

Assumption 5. lim infn→∞ σ > 0.

We stress that both previous assumptions on γ and σ might be alleviated at the price of additionaltechnical details in our statements which can be easily deduced from our proofs. The following resultprovides a bound on the excess risk associated to the OLS procedure. The given bound depends explicitlyon the quantities of interest n and p as well as on the underlying probabilistic model through A?. Theasymptotic framework we consider is with respect to n → ∞ and we allow the dimension p to go toinfinity (with a certain restriction). A discussion is provided below the proposition.

Proposition 4. Suppose that Assumptions 1, 2, 3, 4 and 5 are fulfilled. Suppose that n → ∞ andp = pn →∞ such that p log(p)/n→ 0, we have

E(L(ols)) = OP

(pσ2 + ‖A?‖2F /p

n

).

The previous bound on the excess risk is badly affected by the parameter p. First of all, the conditionp log(p)/n → 0 implies that p n. Second, the number of nonzero coefficients in A poorly influencesthe bound. When A? = I, the second term equals 1/n and becomes negligible with respect to p/n. Incontrast, when A? = (1)k,`, i.e., all the covariates are used to predict each output, it equals p/n andbecomes a leading term. In between, we have the situation where each line of A? possesses only a fewnon-zero coefficients. Then the magnitude of this term becomes 1/n. In this case, additional benefitscan be obtained from the use of the LASSO approach (λ > 0) as detailed in the next section.

3.3 Regularized least-squares

The LASSO approach is introduced to overcome the “large p small n” difficulties of the OLS previouslydiscussed. In contrast with the OLS, the additional ‖ · ‖1-penalty term shall enforce the estimatedmatrix A(lasso) to be “sparse”, i.e., to have only a few non-zero coefficients. This will permit to reducethe number of parameters to estimate (especially when n is small compared to p) and therefore to takeadvantage of any sparsity structure in the matrix A? associated to L?.

Introduce the active set S?k as the set of non-zero coefficients of the k-th line of A?, i.e., for eachk ∈ 1, . . . , p,

S?k =` ∈ 1, . . . , p : A?k,` 6= 0

.

As stated in the following assumption, the sparsity level of each line of A? is assumed to be boundeduniformly over n.

6

Assumption 6. lim supn→∞maxk∈S |S?k | <∞.

For any vector u ∈ Rp and any integer q > 1, the `q-norm is defined as ‖u‖qq =∑pk=1 |uk|q. The

following assumption is a relaxation of Assumption 4 which was needed in the study of the OLS approach.For any set S ⊂ S, denote by S its complement in S, and introduce the cone

C(S, α) = u ∈ Rp : ‖uS‖1 6 α‖uS‖1.

Denote by γ? the smallest nonnegative number such that for all k ∈ 1, . . . , p, we have ‖Σ1/2u‖22 >γ?‖u‖22, ∀u ∈ C(S?k , 3).

Assumption 7. lim infn→∞ γ? > 0.

Similar to Assumption 4, the previous assumption can be alleviated by allowing the value γ? to goto 0 at a certain rate which can be deduced from our proof. We are now in position to give an upperbound on the excess risk for the LASSO model.

Proposition 5. Suppose that Assumptions 1, 2, 3, 5, 6 and 7 are fulfilled. Suppose that n → ∞ andp = pn →∞ such that log(p)/n→ 0, we have

E(L(lasso)) = OP

(log(p)

n

).

provided that λ = C√nσ2 log(p), for some constant C > 0.

The parameter p which influences badly the bound on the OLS excess risk is here replaced by log(p).This shows that without any knowledge on the active variables, the LASSO approach enables to recoverthe accuracy (at the price of a logarithmic factor) of an “oracle” OLS estimator that would use only theactive variables. Another notable advantage is that the assumptions for the validity of the bound havebeen reduced to log(p)/n→ 0 compared to the OLS which needs p log(p)/n→ 0.

The LASSO requires to choose the regularization parameter λ which controls the number of selectedcovariates and is meant to realize the classical bias-variance trade-off. In the proof, we follow theapproach developed in Hastie et al. (2015) by choosing λ as small as possible but larger than a certainempirical sum involving the residuals of the model; see (11) in the Appendix. This explains our choiceλ = C

√nσ2 log(p). Note that we could have done differently. Since the Frobenius norm writes as the

sum of the `2-norms of the matrix lines, problem (4) might be expressed trough p standard LASSOsub-problems each having T outputs. This allows to select different λ in each sub-problem. For the sakeof presentation, we prefer to keep fixed the value of λ during the theoretical analysis. In practice, λ willbe chosen using cross-validation as explained in the simulation section.

4 Regime switching

In our framework of road traffic, regime switching occurs when the optimal matrix A?, which is used topredict the next speed configuration of the road network, changes at some time t∗ ∈ 1, . . . , T. Thishappens when the distribution of road traffic changes at some time in the day. For instance, it mighthappen that a certain matrix A? is suitable to model the morning traffic while another matrix is neededto fit conveniently the afternoon. In this framework, the set of predictors writes simply as ∪t=1,...,TLtwhere each Lt is given by

b+ (AX1, . . . , AXt, A′Xt+1 . . . , A

′XT ) : b ∈ Rp×T , A ∈ Rp×p, A′ ∈ Rp×p.

Each submodel Lt corresponds to a regime switching occurring at time t. Note that the case t = Tcorresponds to the model without regime switching, i.e., a single matrix A is used for the whole day.Within each submodel, the optimal predictor is defined as

L?t = argminL∈LtR(L)

and satisfies some normal equations (involving A and A′) that can be recovered by applying Proposition1 with a specific range for T . The risk associated to a submodel t is then given by R(L?t ).

7

If the change point t? where given, then the situation would be very similar to what has been studiedin the previous section except that two predictors would need to be estimated, one for each time rangeU? = 1, . . . , t? and V ∗ = t? + 1, . . . , T. As each predictor would be computed as

(bt? , At?) ∈ argminb∈Rp×t? , A∈Rp×p

n∑i=1

‖Y (i)·,U? − b−AX

(i)·,U?‖

2F + 2λ‖A‖1,

(b′t? , A′t?) ∈ argminb∈Rp×(T−t?), A∈Rp×p

n∑i=1

‖Y (i)·,V ? − b−AX

(i)·,V ?‖

2F + 2λ‖A‖1,

the results obtained in the previous section would apply to both estimates separately and one wouldeasily derive an excess risk bound when the change point t? is known. The key point here is thus torecover the change point t∗ by some procedure. Next, we study a model selection approach based oncross-validation to estimate t?.

Suppose that T > 2. For each t ∈ 1, . . . , T, define Ut = 1, . . . , t and Vt = t + 1, . . . , T. LetF = I1, . . . , IK be a partition of 1, . . . , n whose elements Ik are called folds. We suppose further thateach fold contains the same number of observation n/K. For each fold I ∈ F , solve both optimizationproblems

(bI,t, AI,t) ∈ argminb∈Rp×t, A∈Rp×p

∑i∈I

‖Y (i)·,Ut − b−AX

(i)·,Ut‖

2F + 2λ‖A‖1,

(b′I,t, A′

I,t) ∈ argminb∈Rp×(T−t), A∈Rp×p

∑i∈I

‖Y (i)·,Vt − b−AX

(i)·,Vt‖

2F + 2λ‖A‖1,

where I is the complementary of I in 1, . . . , n. To lighten the notations, the superscript (Lasso) willbe from now on avoided. The estimate of the risk based on the fold I is defined as

RI,t =K

np

∑i∈I‖Y (i)·,Ut − bI,t − AI,tX

(i)·,Ut‖

2F + ‖Y (i)

·,Vt − b′I,t− A′

I,tX

(i)·,Vt‖

2F .

The resulting cross-validation estimate of Rt is then the average over the folds I ∈ F of the risks RI,tand the estimated change point t is the one having the smallest estimated risk, i.e.,

t ∈ argmint=1,...,T Rt :=1

K

∑I∈F

RI,t.

This t represents the best instant from which a different (linear) model shall be used. The followingproposition provides a rate of convergence for the cross-validation risk estimate.

Proposition 6. Under the assumptions of Proposition 5, we have for all t ∈ 1, . . . T − 1,

|Rt −R(L?t )| = OP

(√log(p)

n

).

The difficulty in proving the previous result is to suitably control for the large dimension in thedecomposition of the error. This is done by relying on a bound on the `1-error associated to the matrix

estimate A(lasso)n ; see Proposition 7 in the Appendix.

Applying the previous proposition allows to establish the consistency of the cross-validation detectionprocedure.

Corollary 1. Under the assumptions of Proposition 5, suppose there exists t∗ ∈ 1, . . . T − 1 such thatlim supn→∞R(L?t?) < lim infn→∞R(L?t ) for all t 6= t∗, then

P (t = t∗)→ 1.

Interestingly, the previous regime switching detection can be iterated in a greedy procedure in which,at each step, we decide, using cross-validation as before, if a regime switching is beneficial. If it does,we continue. If it does not, we stop. Such an iterative algorithm is greedy and difficult to study as itonly tests for one change point at each iteration, whereas to truth that one wishes to discover involvespotentially many change points.

8

5 Real data analysis

In this section, we conduct a comparative study of different methods. We first present the real worlddata used for this purpose. Then, we describe the different methods in competition and analyze theobtained results. Finally, we show graphical interpretations of the proposed Lasso approach. A GitHubrepository is available at https://github.com/mohammedLamine/RS-Lasso.

5.1 Dataset

The initial dataset contains the speed (km/h) and the localization of each car using the Coyote navigationsystem (Floating car data). Data is collected in the Rennes road network, over the period from 1st ofDecember 2018 until the 9th of July 2019 every 30 seconds. Signals were received from 113577 vehicle.

Some pre-processing has been done to correct sensor errors and structure the data in a convenientformat. First, the received locations are GPS coordinates whereas the model introduced here is basedon road section. Accordingly, we map-match the locations to the OpenStreetMap road network dataset(Greenfeld, 2002), so that we get the section of each car location data. Second, to obtain data representingthe flow in the sections of the network, we aggregate and average the observed speed values of cars over15-minutes intervals for each section. Third, another issue is that, for some sections and time interval,no value is observed. We impute these values using the historical average associated to the time intervaland section. To lower the number of missing values, we focus on a particularly active time period from3pm to 8pm (local time zone) of each day on workdays. It is worth noting that, this dataset comes from alow-frequency collection, meaning that, for a given time and section, the average number of observationson the selected sections of the network is about 6 logs per 15-minutes per section. This is due to thefact that, we only observe a portion of the vehicles in the network (those equipped with Coyote GPSdevices), and that suburban sections have low traffic. Thus, going for better precision (less than 15minutes) would increase the number of missing values.

In summary, we consider only the 15-minute time intervals in the period from 3pm to 8pm (localtime zone) of each day on workdays only. The resulting dataset corresponds to our W (i), i = 1, . . . , n,matrices where p = 556, n = 144 and T + 1 = 20 (Figure 1). We divide the data into 3 subsets: train63% (1710 examples), validation 27% (741 examples), test 10% (285 examples), splits were made suchthat days are not cut in the middle and all subsets contains sequential full days data.

5.2 Comparative study

5.2.1 Methods

The methods used in this study can be divided into 2 groups reflecting the information that is used topredict. Methods of the first group, called baselines, predicts the speed of a given section using onlydata about this section. Methods of the second group might predict using data collected from the wholenetwork. This includes the linear predictors that have been studied previously and also certain neuralnetwork predictors that we introduce for the sake of completeness.

Baselines

• Historical average (HA): for each section k and time interval t, the prediction is given by theaverage speed at time interval t and section k. The averaged speed is computed on the trainingdataset.

• Previous observation (PO): for each section k and time interval t, the prediction is given by theobserved speed at t− 1 and section k.

• Autoregressive (AR): for each section k and time interval t, the prediction is given by a linearcombination of the t0 previous observed time speed t− 1, . . . , t− t0 in section k. The coefficientsare estimated using ordinary least-squares on the training dataset. In our study, we use theimplementation from the ”Statsmodels” package and vary the order t0 from 1 to 5.

9

https://github.com/mohammedLamine/RS-Lasso

Linear predictors

• Ordinary least-squares (OLS): we first compute the linear predictor for each section separately.It corresponds to solving independently p sub-problems using OLS. Then we stack the solutions(βk)16k6p ⊂ Rp to form the solution AT = (β1, . . . βp) ∈ Rp×p of the main problem.

• LASSO : We compute the LASSO predictor the same way as the OLS by solving independentlyp sub-problems. For each sub-problem, we select the regularization coefficient λ using the 5-foldcross-validation from the sklearn package on our training data. Taking the same value for λ inall sub-problems (as it is case in the theoretical analysis) does not impact the results.

• Time-specific LASSO (TS-LASSO): We subdivide the problem into smaller time-specific problemsthat we solve separately. The model find a different coefficients matrix A for each time t ∈1, . . . , T. We compute each of these predictors in the same way as the LASSO.

• Regime switching LASSO (RS-LASSO): This approach is introduced in Section 4 and consists insearching for the best time instant t from which we should use another matrix A to predict. Foreach t ∈ 1, . . . , T, we run a first LASSO procedure over the time frame 1, . . . , t and anotherone over t+1, . . . , T. For each t, the resulting risk is estimated using cross-validation and finally,we find t as the time change having the smallest estimated risk.

• Group LASSO (Grp-LASSO): The Grp-LASSO is similar to the LASSO but uses a differentpenalization function which involves the norm of each column of the coefficients matrix A. Thisapproach filters out sections that are found to be irrelevant to predict the complete network andhence is useful when the same sparsity structure is shared among the different sections.

Neural networks predictors

• Multilayer perceptron (MLP) made of 2 fully connected hidden layers. Data is normalized at theentry of each hidden layer. All model parameters are regularized using an `1 regularization.

• Fully connected long short-term memory (FC-LSTM) composed of one LSTM layers followed bya fully connected layer (output layer) with 2-time steps. The model parameters are regularizedusing an `1 regularization. (note that we do not regularize the recurrent step in the LSTM).

These neural networks are built using hyperbolic tangent as activation function except for the outputlayer for which the identity function is used. The input data values are scaled so that they belong to[−1, 1]. We train our models using ADAM optimizer and mean squared error loss. Because of the highdimensionality of the model (p = 556), all non-regularized approaches overfit the training dataset. Toasses this problem, we rely on a combination of 2 regularization techniques:

• Earlystopping Caruana et al. (2001) is used to stop model training when the model starts overfit-ting.

• `1-regularization is conducted by adding a penalization term to the loss function that enforces asparse structure of the parameters.

We stress that other configurations have been tried without improving the results: (a) using the sigmoid,the ReLU and its variants as activation functions; (b) Log-scaling the input data and scaling to theinterval [0, 1] rather than [−1, 1], (c) varying the number of layers, neurons and lags and (d) otherregularization techniques. The models are built using the package Keras with TensorFlow backend inpython.

5.2.2 Results

The methods are compared using the mean-squared error (MSE) and the mean absolute error (MAE).Both are computed on the test set of the data. We shall examine these errors over all the sections andover some specific sections; over the whole time period and focusing on more specific periods such asparticular days.

10

Performance on the whole road network

Model AR 1 AR 3 AR 5 FC-2LSTM HA LASSO MLP OLS PO RS-LASSO TS-LASSO Grp-LASSO

MAE 7.25 7.23 7.23 7.21 7.72 7.14 7.26 9.20 9.38 7.13 7.34 7.30

MSE 113.85 113.22 113.22 111.44 133.41 109.24 113.57 162.79 183.38 109.04 115.47 113.81

Table 1: MSE and MAE computed over the whole network and the whole time period.

(a) (b)

Figure 2: MSE and MAE computed (a) for each days and (b) for each specific hours.

Table 1 shows the performance of all models over the whole network. The simple OLS model does notgeneralize well together with PO prediction, they give the worst performances. The OLS is subjected toan over-fitting problem as the MAE is equal to 9.20 (resp. 5.69) on the test set (resp. train set). Otherbaselines performances are similar with AR giving a slightly better performance than the HA. Note thatincreasing the number of lags does not lead to better results. The neural networks give close results tothose of the AR with the LSTM performing slightly better than the MLP model. The different LASSOapproaches also gave variable results, the model with regime-switching (RS-LASSO with time switch at6:15pm) performs better than all other models (7.13 MAE) and is just slightly better than the LASSO(−0.01 MAE). Using a different LASSO predictor for each time (TS-LASSO) does not improve the errorand give a worse performance than the LASSO (+0.20 MAE). The same observation on Grp-LASSOsuggests that entirely filtering some sections causes a certain loss.

In Figure 2, we present (a) the error computed for each specific day and (b) the error for eachspecific hours (averaged over the days). For readability, we only present a subset of the models. Thefigure shows the good performance of LASSO and RS-LASSO over days and time and confirms thatthey outperform all other models. The two approaches alternate performance over time, we notice thatthe RS-LASSO performs better at first and last 1.5 hours, and is worse than the LASSO in the middlehours. As this cut-off is happening about the RS-LASSO switch (6:15pm), it might be possible that theRS-LASSO parts (before and after switching) focus on modeling the regime of the corresponding periodsand overlooks the middle period that is separated by the time switch.

Performance on highly variable sections

To further understand the results, we compare the models on highly variable sections. Variable sectionsare selected based on a clustering algorithm applied to descriptive statistics (Min, Max, Mean, standarddeviation, Quartiles) computed for each section at each time. The cluster represents about 30% of thesections.

11

Model AR 1 AR 3 AR 5 FC-2LSTM HA LASSO MLP OLS PO RS-LASSO TS-LASSO Grp-LASSO

MAE 9.27 9.26 9.27 9.11 10.18 9.01 9.17 11.50 11.81 8.94 9.20 9.17

MSE 172.59 171.94 171.98 163.45 207.18 161.29 165.39 238.91 274.92 159.66 168.01 165.28

Table 2: MSE and MAE restricted to highly variable sections and computed (a) for each daysand (b) for each specific hours.

(a) (b)

Figure 3: The prediction error for each method (a) over days, (b) over time.

Table 2 shows the performance over highly variable sections only. All model performances deterioratecompared to the performances on the full road network. Similar observations as as before can be madeon these results with the exceptions that, both neural networks, TS-LASSO and Grp-LASSO performbetter than the autoregressive models (−0.2 MAE) . The performance gap between the models is wider.The RS-LASSO is clearly better than the LASSO (−0.07 MAE) compared to the observed performanceon the whole network. The alternation of performance over time is still observable between LASSO andRS-LASSO models.

Comparison of different penalties

We now compare the performances between LASSO, Ridge and Elastic Net penalties defined as:

• LASSO (`1) penalty : w 7→ λ||w||1.

• Ridge (`2) penalty : w 7→ λ||w||22.

• Elastic Net penalty: w 7→ λ(α||w||1 + (1− α)||w||22).

where λ is the regularization parameter and α ∈ [0.1, 0.9] represents some ratio between LASSO andRidge penalties. For each method, λ and α are determined through the application of a 5-fold cross-validation. For Elastic Net, the best α was around 0.6 giving the `1-penalty a slightly higher weight.

Penalty OLS (No penalty) LASSO Ridge ElasticNet

MSE 162.83 105.89 110.78 108.48

MAE 9.22 7.00 7.15 7.09

Table 3: MSE and MAE computed over the whole network and the whole time period for thepenalties LASSO, Ridge, Elastic Net.

Table 3 shows the obtained MAE and mse for all the models. The LASSO penalty shows a slightlybetter performance than the other two penalties. This can be explained by the ability of LASSO toeffectively select features for the model which is increasingly important in sparse high dimensional setup.

12

Thus choosing the `1-penalty for this problem seems to be efficient and also gives a better interpretationof the model’s parameters.

5.3 Graphical visualization of the LASSO model

In this section, the aim is to show that the LASSO coefficients are interpretable and can be used tocapture certain features of the architecture of the road network.

The LASSO coefficients are indicators of the traffic flow dependencies between the sections of theroad network. For a section of interest, say k, a way to visualize these connections is to draw on top ofthe network map arcs going from section k to sections for which the corresponding regression coefficientis non-zero (that belongs to the active sets S?k). In addition, the color intensity and the width of eacharcs can be scaled by the value of the corresponding coefficient in order to highlight the influence of onesection on another.

The previous visualization is provided for several representative sections in Figure 4. In Figure 4,(a, b) represents sections from the Rennes ring roads. These roads have an upper limit of 70km/h andcircles the whole city of Rennes. We see that these sections connect strongly to adjacent sections in thering and also to the closest highways. This is expected as these sections are physically connected to eachother as they form a ring road and have links to adjacent highways. We also observe this kind of behavioron the highways (Figure 4.c) where the sections of the highway have a higher dependence on adjacentsections of the highway. In Figure 4, (d, e) shows sections that are part of link roads. We see that thesesections have a high number of connections to adjacent ring roads. These link roads usually connect theflow between relatively distant parts of the city so we also observe farther connections around the ringnetwork. We observe similar aspects with sections from the city center as shown in Figure 4, (f).

Another meaningful visualization provided by the model consists in representing the influence of eachsection on the overall network using the values of the estimated coefficients. For section `, the criterionis given by C` =

∑16k6p,A(lasso)

k` >0Ak`. Taking into account negative coefficients implies an inverse

relationship between sections, meaning that when a section is getting congested, another is getting lesstraffic. This may happen when there are incidents on the road but does not characterize the averageinfluence between sections. Thus we only consider positive coefficients in building the criterion (consistingof 93% of the coefficients).

Figure 5 is a heatmap of the road network where each section is colored according to the influencecoefficient C`. We see that the network behavior is mainly influenced by road ring sections and highways.This is expected as the main traffic flows on such roads while most sections in the city center, and thosein the rural and suburban areas carry a smaller part of the traffic.

6 Simulation study

In this section we simulate data exhibiting a similar behavior as the one of the real dataset examined inthe previous section. After a brief description of the data generation process, we use this data to validateour regime switching proposal and compare it to other baseline algorithms.

6.1 Data generation process

Model parameters

The intercept sequence (bt)t=t,...,T is defined as follows. For each t ∈ 1, . . . , T and k ∈ 1, . . . , p,(bt)k = −(2.52 − (t − 17.5)2) + εt,k where εt,k is a white noise Gaussian vector. The matrix A isconstructed as follows.

• Generate an Erdos-Renyi graph (Erdos and Renyi, 1959) with an average of 8 connections for eachnode in the network resulting in a sparse structure representing our initial matrix A.

• Generate independently the nonzero entries of the matrix A according to a uniform distributionon [−1, 1].

• Normalize the matrix so that each line has an `2-norm equal to 1.

13

(a) (b)

(c) (d)

(e) (f)

Figure 4: For each graph, arcs are drawn from the section of interest to the active sectionsdeduced from the estimated matrix coefficients A(lasso). Color intensity and width are scaledaccording to the value of the matrix coefficients. Different types of section are considered: (a)and (b) ring roads; (c) highway; (d,e) link roads; (f) city center.

This last step helps to obtain a reasonable dynamics for the sequences of interest whose generationis described below.

Generation of the sequence

Let b be an intercept matrix and let A and A′ be two matrices constructed following the previous

algorithm. For each day ∈ 1 . . . n, generate an initial speed vector W(i)0 of size p as a mixture of 3

Gaussian distributions (representing 3 different classes of speed) given by

0.25N (S1, 0.1S1/2) + 0.5N (S2, 0.1S2/2) + 0.25N (S3, 0.1S3/2)

14

Figure 5: Heatmap plot of the sections based on the influence criterion C`.

with S1 = 45, S2 = 72 and S3 = 117. Generate (ε(i)t )t=1,...T ⊂ Rp as an independent and identically

distributed sequence with distribution N (0, Ip). The data is then obtained based on

W(i)t+1 = bt +A(W

(i)t − E[W

(i)t ]) + ε

(i)t , t = 1, . . . , 11,

W(i)t+1 = bt +A′(W

(i)t − E[W

(i)t ]) + ε

(i)t , t = 12, . . . , 18.

6.2 Results

The RS-LASSO approach introduced in Section 4 follows from a search for the best instant t from whicha different (linear) model shall be used. This search is conducted by comparing the estimated risksRt, t = 1, . . . , T , where t is the instant from which a second model is estimated. Figure 6 representsthe evolution of the estimated risks for a given dataset. The best change point is well identified as theprocedure gives t = 11.

Model OLS MLP FC-2LSTM AR 1 AR 3 AR 5 HA PO LASSO Grp-LASSO RS-LASSO TS-LASSO

MSE 18.94 16.13 17.43 52.55 45.11 44.9 60.32 110.68 13.97 14.81 1.13 5.30

MAE 3.24 2.96 3.13 5.65 5.21 5.2 6.04 7.95 2.75 2.85 0.85 1.75

Table 4: MSE and MAE computed over the whole network and the whole time period.

In Table 4, we compare the RS-LASSO to the other methods based on M = 20 independent datasets(the MSE and MAE have been averaged over the datasets). The RS-LASSO model obtains the bestperformance and clearly outperforms the other methods. The fact that RS-LASSO succeeds in identifyingthe exact switching time allows it to build-up very good predictors for both time-windows before and afterthe switch. Nonlinear models such as MLP and LSTM obtained slightly lower performances than theLASSO and slightly better than OLS. The baseline methods HA, PO, and AR gave the worst performanceon the simulated data as well as on the real data. To summarize, we distinguish four classes reflectingthe level of performances. Baseline methods have the worst results. The one-model methods (LASSO,Ols, Grp-LASSO, MLP, LSTM) give reasonable results. The multiple models’ method TS-LASSO hasa better performance than one-model methods. Finally, the regime-switching method performs the bestby fitting only two models to the data.

For further investigation of the RS-LASSO model, we compare the estimated matrices A and A′ tothe ones used for data generation. Table 5 shows the Frobenius distance (FD) ‖A− A‖F + ‖A′ − A′‖Falong with the support recovery rates (SR) defined as the percentage of accordance in the non-zerocoefficients (active set) between the estimated matrix and the truth. In contrast with the LASSO, we

15

Figure 6: Evolution of the estimated risks

FD(RS-LASSO) FD( LASSO) SR(RS-LASSO) SR( LASSO)

Left model 1.592872 19.755836 96.49% 79.11%

Right model 1.301384 18.232622 96.37% 79.22%

Table 5: Frobenius distance and support recovery rate for LASSO and RS-LASSO

observe a very small Frobenius error for RS-LASSO. At the same time the support recovery of RS-LASSOis over 96% compared to around 79% for the LASSO. This confirms the good performance of RS-LASSO.

7 Conclusion

We proposed in this paper a methodology to model high-dimensional time-series for which regenerationoccurs at the end of each day. In a high-dimensional setting, we established bounds on the prediction errorassociated to linear predictors. More precisely, we showed that the regularized least-squares approachprovides a decrease of the prediction error bound compared to ordinary least-squares. In addition,a regime switching detection procedure has been proposed and its consistency has been established.Through simulations and using floating car data observed in an urban area, the proposed approach hasbeen compared to several other methods. Further research might be conducted on the extension of theproposed approach to general Markov chains. In this case, the regeneration times are unobserved andoccur randomly involving a completely different probabilistic setting than the one borrowed in this paper.Whereas regeneration properties are known to be powerful theoretical tools to study estimates based onMarkov Chains (see e.g., (Bertail and Portier, 2019) for a recent account), their use in a predictionframework seems not to have been investigated.

Acknowledgment The authors are grateful to Maxime Bourliatoux for helpful discussions on anearlier version of this work. This work was financially supported by the FUI 22 GEOLYTICS project.The authors would like to thank IT4PME and Coyote for providing data and for their support.

16

References

Baek, C., Davis, R. A., and Pipiras, V. (2017). Sparse seasonal and periodic vector autoregressivemodeling. Computational Statistics & Data Analysis, 106:103–126.

Basu, S., Michailidis, G., et al. (2015). Regularized estimation in sparse high-dimensional time seriesmodels. The Annals of Statistics, 43(4):1535–1567.

Bertail, P. and Portier, F. (2019). Rademacher complexity for markov chains: Applications to kernelsmoothing and metropolis–hastings. Bernoulli, 25(4B):3912–3938.

Bickel, P. J., Ritov, Y., Tsybakov, A. B., et al. (2009). Simultaneous analysis of lasso and dantzigselector. The Annals of Statistics, 37(4):1705–1732.

Brockwell, P. J., Davis, R. A., and Fienberg, S. E. (1991). Time Series: Theory and Methods: Theoryand Methods. Springer Science & Business Media.

Broock, W. A., Scheinkman, J. A., Dechert, W. D., and LeBaron, B. (1996). A test for independencebased on the correlation dimension. Econometric reviews, 15(3):197–235.

Buhlmann, P. and Van De Geer, S. (2011). Statistics for high-dimensional data: methods, theory andapplications. Springer Science & Business Media.

Bull, A. (2003). 1, 2, 6. United Nations, Economic Commission for Latin America and the Caribbean.

Bunea, F., Tsybakov, A., Wegkamp, M., et al. (2007). Sparsity oracle inequalities for the lasso. ElectronicJournal of Statistics, 1:169–194.

Caruana, R., Lawrence, S., and Giles, C. L. (2001). Overfitting in neural nets: Backpropagation,conjugate gradient, and early stopping. In Advances in neural information processing systems, pages402–408.

Dai, X., Fu, R., Lin, Y., Li, L., and Wang, F.-Y. (2017). Deeptrend: A deep hierarchical neural networkfor traffic flow prediction. arXiv preprint arXiv:1707.03213.

Davis, R. A., Zang, P., and Zheng, T. (2016). Sparse vector autoregressive modeling. Journal ofComputational and Graphical Statistics, 25(4):1077–1096.

Epelbaum, T., Gamboa, F., Loubes, J.-M., and Martin, J. (2017). Deep learning applied to road trafficspeed forecasting. arXiv preprint arXiv:1710.08266.

Erdos, P. and Renyi, A. (1959). On random graphs i. Publ. math. debrecen, 6(290-297):18.

Giraud, C. (2014). Introduction to high-dimensional statistics, volume 138. CRC Press.

Greenfeld, J. S. (2002). Matching gps observations to locations on a digital map. In 81th annual meetingof the transportation research board, volume 1, pages 164–173. Washington, DC.

Hamilton, J. D. (1994). Time series analysis, volume 2. Princeton university press Princeton, NJ.

Harrison, C. and Donnelly, I. A. (2011). A theory of smart cities. In Proceedings of the 55th AnnualMeeting of the ISSS-2011, Hull, UK, volume 55.

Hastie, T., Tibshirani, R., and Wainwright, M. (2015). Statistical learning with sparsity: the lasso andgeneralizations. Chapman and Hall/CRC.

Haufe, S., Muller, K.-R., Nolte, G., and Kramer, N. (2010). Sparse causal discovery in multivariate timeseries. In Causality: Objectives and Assessment, pages 97–106.

Kock, A. B. and Callot, L. (2015). Oracle inequalities for high dimensional vector autoregressions.Journal of Econometrics, 186(2):325–344.

17

Li, Y., Yu, R., Shahabi, C., and Liu, Y. (2017). Diffusion convolutional recurrent neural network:Data-driven traffic forecasting. arXiv preprint arXiv:1707.01926.

Lv, Y., Duan, Y., Kang, W., Li, Z., and Wang, F.-Y. (2014). Traffic flow prediction with big data: adeep learning approach. IEEE Transactions on Intelligent Transportation Systems, 16(2):865–873.

Lv, Z., Xu, J., Zheng, K., Yin, H., Zhao, P., and Zhou, X. (2018). Lc-rnn: A deep learning model fortraffic speed prediction. In IJCAI, pages 3470–3476.

Meyn, S. P. and Tweedie, R. L. (2012). Markov chains and stochastic stability. Springer Science &Business Media.

Michailidis, G. and dAlche Buc, F. (2013). Autoregressive models for gene regulatory network inference:Sparsity, stability and causality issues. Mathematical biosciences, 246(2):326–334.

Pellicer, S., Santa, G., Bleda, A. L., Maestre, R., Jara, A. J., and Skarmeta, A. G. (2013). A globalperspective of smart cities: A survey. In 2013 Seventh International Conference on Innovative Mobileand Internet Services in Ubiquitous Computing, pages 439–444. IEEE.

Song, S. and Bickel, P. J. (2011). Large vector auto regressions. arXiv preprint arXiv:1106.3915.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal StatisticalSociety: Series B (Methodological), 58(1):267–288.

Valdes-Sosa, P. A., Sanchez-Bornot, J. M., Lage-Castellanos, A., Vega-Hernandez, M., Bosch-Bayard,J., Melie-Garcıa, L., and Canales-Rodrıguez, E. (2005). Estimating brain functional connectivitywith sparse multivariate autoregression. Philosophical Transactions of the Royal Society B: BiologicalSciences, 360(1457):969–981.

Van De Geer, S. A. and Buhlmann, P. (2009). On the conditions used to prove oracle results for thelasso. Electronic Journal of Statistics, 3:1360–1392.

Wang, H., Li, G., and Tsai, C.-L. (2007). Regression coefficient and autoregressive order shrinkage andselection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology),69(1):63–78.

Washburn, D., Sindhu, U., Balaouras, S., Dines, R. A., Hayes, N., and Nelson, L. E. (2009). Helpingcios understand smart city initiatives. Growth, 17(2):1–17.

Before starting the proofs, we specify some useful notations. For any matrix M ∈ Rp×T , the `1-normis denoted by ‖M‖1 =

∑pk=1

∑Tt=1 |Mk,t| and the `∞-norm is given by ‖M‖∞ = maxk=1,...,p, t=1,...,T |Mk,t|.

The t-th column (resp. k-th line) of M is denoted by M·,t ∈ Rp (resp Mk,· ∈ RT ). Both are columnvectors.

Appendix A Proofs of the stated results

A.1 Proof of Proposition 1

Denote by L2(P ) the Hilbert space composed of the random elements W defined on (Ω,F , P ) such thatE‖W‖F < +∞. The underlying scalar product between X and Y is tr(E[XTY ]). Hence infL∈LE[‖Y −L(X)‖2F ] is a distance between the element Y of L2(P ) and the linear subspace of L2(P ) made of L(X),L ∈ L. The Hilbert projection theorem ensures that the infimum is uniquely achieved and that L?(X)is the argument of the minimum if and only if tr(E[(Y − L?(X))TL(X)]) = 0 for all L ∈ L. TakingA = 0, in L(X) = b + AX, gives the first set of normal equations. We conclude by taking b = 0 andnoting that tr(E[(Y − L?(X))TL(X)]) = tr(E[X(Y − L?(X))T ]A) = 0 for all A ∈ Rp×p is equivalent toE[(Y − L?(X))XT ] = 0. This is the second set of normal equations.

18


The statement follows from developing pR(L) = E[‖(Y − L?(X)) + (L?(X) − L(X))‖2] and then usingthat tr(E[(Y − L?(X))(L(X)− L?(X))T ]) = 0 which is a consequence of the normal equations given inProposition 1.

A.2.1 Proof of Proposition 3

Let Xc = X − E[X]. Recall that L(X) = b+ AX and L?(X) = E[Y ] + A?Xc. In virtue of Proposition2, we have

pE(L) = EX [‖(A? − A)Xc + E(Y )− AE(X)− b‖2F ],

where EX stands for the expectation with respect to X only. Because, Xc has mean 0, we get

pE(L) = EX [‖(A? − A)Xc‖2F ] + ‖E(Y )− AE(X)− b‖2F .

Finally, since EX [‖(A? − A)Xc‖2F ] = ‖(A? − A)Σ1/2‖2F , and using the triangle inequality, we find

pE(L) 6 ‖(A? − A)Σ1/2‖2F + 2‖E(Y )− Y n‖2F + 2‖A(E(X)−Xn)‖2F .


The inequality of Proposition 3 applied to the OLS estimate gives

pE(L(ols)) 6 ‖(A? − A(ols))Σ1/2‖2F + 2‖Y n − E(Y )‖2F + 2‖A(ols)Σ1/2Zn‖2F .

Using that ‖A(ols)Σ1/2Zn‖F 6 ‖(A(ols) −A?)Σ1/2‖F ‖Z

n‖F + ‖A?Σ1/2Zn‖F , we get

pE(L(ols)) 6 ‖(A? − A(ols))Σ1/2‖2F (1 + 4‖Zn‖2F ) + 2‖Y n − E(Y )‖2F+4‖A?Σ1/2Z

n‖2F .

Let Γ denote the greatest eigenvalue of Σ. By Assumption 3, we have lim supn→∞ Γ < ∞. ApplyingLemma 9, we deduce that

pE(L(ols)) = ‖(A? − A(ols))Σ1/2‖2F(

1 +OP

( pn

))+OP

(p+ ‖A?Σ1/2‖2F

n

)= ‖(A? − A(ols))Σ1/2‖2FOP (1) +OP

(p+ ‖A?Σ1/2‖2F

n

)= ‖(A? − A(ols))Σ1/2‖2FOP (1) +OP

(p+ Γ‖A?‖2F

n

), (5)

where we just used that p/n→ 0 and ‖A?Σ1/2‖2F =∑pk=1 ‖A?Tk,·Σ

1/2‖22 6 Γ‖A?‖2F . Hence the proof ofProposition 4 will be complete if it is shown that

‖(A(ols) −A?)Σ1/2‖2F = OP

(p2σ2

n

).

By definition of A(ols), it holds that

n∑i=1

‖Y (i)c − A(ols)Σ1/2Z(i)

c ‖2F 6n∑i=1

‖Y (i)c −A?Σ1/2Z(i)

c ‖2F .

where

Z(i)c = Σ−1/2(X(i) −Xn

) and Y (i)c = Y (i) − Y n.

19

Let ∆(ols) = (A? − A(ols))Σ1/2. Developing the squared-norm in the left-hand side and making use ofthe centering property, we obtain

n∑i=1

‖∆(ols)Z(i)‖2F 6 2

∣∣∣∣∣n∑i=1

〈Y (i)c −A?Σ1/2Z(i), ∆(ols)Z(i)〉F

∣∣∣∣∣ = 2

∣∣∣∣∣n∑i=1

〈ε(i) − εn, ∆(ols)Z(i)〉F

∣∣∣∣∣ , (6)

with

ε(i) = Y (i) − L?(X(i)).

We now provide a lower bound for the left-hand side of (6). Define

Π = Σ1/2ΣΣ1/2

Σ = n−1n∑i=1

(X(i) −Xn)(X(i) −Xn

)T

We have n−1∑ni=1 ‖∆(ols)Z(i)‖2F = tr(∆(ols)Π∆(ols)T ), and because Π is symmetric we can write Π =

UDUT with UTU = Ip (Ip is the identity matrix of size p×p) and D = diag(d1, . . . , dp). Hence, defining

∆(ols) = (∆(ols)·,1, . . . , ∆

(ols)·,p) = ∆(ols)U , it holds that

n−1n∑i=1

‖∆(ols)Z(i)‖2F = tr

(p∑k=1

dk∆(ols)·,k∆

(ols)T·,k

)> γn

p∑k=1

‖∆(ols)·,k‖

22

= γn‖∆(ols)‖2F .

For the right-hand side of (6), we provide the following upper bound. Using the Cauchy-Schwarz in-equality, we have ∣∣∣∣∣

n∑i=1

〈ε(i) − εn, ∆(ols)Z(i)〉

∣∣∣∣∣ =

∣∣∣∣∣〈n∑i=1

(ε(i) − εn)Z(i)T , ∆(ols)〉

∣∣∣∣∣6 ‖

n∑i=1

(ε(i) − εn)Z(i)T ‖F ‖∆(ols)‖F .

Together with (6), the two previous bounds imply that

γn‖∆(ols)‖F 6 2‖n−1n∑i=1

(ε(i) − εn)Z(i)T ‖F .

Note that, for all u ∈ Rp with ‖u‖2 = 1,

uT Πu = 1 + uT (Π− I)u > 1− ‖Π− I‖.

where ‖ · ‖ stands for the spectral norm. Hence we have that(1− ‖I − Π‖

)‖∆(ols)‖F 6 2‖n−1

n∑i=1

(ε(i) − εn)Z(i)T ‖F .

Because the term in the right hand-side is OP (pσ/√n) in virtue of Lemma 10, we just have to show that

‖Π− I‖ → 0 in probability to conclude the proof. For that purpose we use Lemma 11 with B given bythe following inequality (which holds true in virtue of Assumptions 3 and 4),

tr(XTc Σ−1Xc

)=

T∑t=1

(X·,t − E(X·,t)TΣ−1(X·,t − E(X·,t)

6T∑t=1

γ−1‖X·,t − E(X·,t‖22

6 TpU2/γ,

where, almost surely, U = lim supn→∞maxk∈S, t∈T |Xk,t − E(Xk,t)|. Because p log(p)/(nγ) → 0, the

upper bound given in Lemma 11 goes to 0 and it holds that ‖Π− I‖ → 0, in probability.

20


Define ∆ = (A? − A(lasso)). A similar decomposition to (5) is valid for the LASSO. We have

pE(L(lasso)) 6 ‖∆Σ1/2‖2FOP (1) +OP

(p+ Γ‖A?‖2F

n

),

implying that the proof reduces to ‖∆Σ1/2‖2F = OP ((p/n) log(p)). To this aim, we will provide a

bound on ‖Σ1/2∆k,·‖2 for each value of k = 1, . . . , p. The minimization of the LASSO problem canbe done by solving p minimization problems where each gives one of the lines of the estimated matrix

A(lasso) = (A(lasso)1,· , . . . , A

(lasso)p,· )T . More formally, we have

A(lasso)k,· ∈ argminβ∈Rp

n∑i=1

‖(Y (i)k − Yk

n)− (X(i) −Xn

)Tβ‖22 + 2λ‖β‖1.

Denote by s?∞ = lim supn→∞maxk∈S |S?k | and let U be such that, almost surely,

lim supn→∞

maxk∈S, t∈T

|Xk,t − E(Xk,t)| 6 U and lim supn→∞

maxk∈S, t∈T

|εk,t| 6 U.

Such a U exists in virtue of Assumption 3 and 6. Set

λ = 8√T 2U2σ2n log(p3)

Intermediary statements, that are useful in the subsequent development, are now claimed and will beproved at the end of the proof. Let 0 < δ < 1. For n large enough, we have with probability 1− δ/2, forall k ∈ 1, . . . , p,

n‖Σ1/2∆k,·‖22 6 λ(3‖∆k,S?k‖1 − ‖∆k,S?

k‖1), (7)

In particular, ∆k,· ∈ C(3, S?k). Moreover, for n large enough, with probability 1− δ/2,

16‖Σ− Σ‖∞(s?∞/γ?) 6 1/2. (8)

In the following, we assume that (7) and (8) are satisfied. This occurs with probability 1− δ. From (7),we have ‖∆k‖1 = ‖∆k,S?

k‖1 + ‖∆k,S?

k‖1 6 4‖∆k,S?

k‖1. Then, using Jensen inequality, Assumption 7, and

finally (8), it follows that

|‖Σ1/2∆k,·‖22 − ‖Σ1/2∆k,·‖22| = |∆Tk,·(Σ− Σ)∆k,·|

6 ‖Σ− Σ‖∞‖∆k,·‖216 16‖Σ− Σ‖∞‖∆k,S?

k‖21

6 16‖Σ− Σ‖∞|S?k |‖∆k,S?k‖

22

6 16‖Σ− Σ‖∞|S?k |‖∆k,·‖226 16‖Σ− Σ‖∞(|S?k |/γ?)‖Σ1/2∆k,·‖226 ‖Σ1/2∆k,·‖22/2.

In particular

‖Σ1/2∆k,·‖22 6 2‖Σ1/2∆k,·‖22. (9)

Invoking Assumption 7 we find

‖∆k,·‖22 6 ‖Σ1/2∆k,·‖22/γ? 6 2‖Σ1/2∆k,·‖22/γ?,

which in turn gives

‖∆k,S?k‖1 6

√|S?k |‖∆k,S?

k‖2 6√|S?k |‖∆k,·‖2 6

√2(|S?k |/γ?)‖Σ

1/2∆k,·‖2. (10)

21

Injecting this into the right-hand side of (7), we get

n‖Σ1/2∆k,·‖22 6 3λ

√2|S?k |γ?‖Σ1/2∆k,·‖2,

which implies that

n‖Σ1/2∆k,·‖2 6 3λ

√2|S?k |γ?

.

We conclude the proof using (9) to obtain

‖∆Σ1/2‖2F =

p∑k=1

‖Σ1/2∆k,·‖22 636λ2

n2γ?

p∑k=1

|S?k | =Cpσ2 log(p)

n,

for some C > 0 that does not depend on (n, p), and recalling that the previous happens with probability1− δ with δ arbitrarily small.Proof of Equation (7) and (8). As log(p)/n→∞ and lim infp→∞ σ2 > 0, we have, for n large enough,n > 4(U2/σ2) log(12p2/δ) > log(8p2/δ). Note that the last inequality implies that we can apply Lemma12 and 13.Proof of Equation (7). Recall that λ = 4

√T 2U2σ2n log(12p3). Note that for p large enough,

λ > 4√T 2U2σ2n log(12p2/δ).

It follows from Lemma 12 that with probability 1− δ/2:

λ > 2

∥∥∥∥∥n∑i=1

(ε(i) − εn)(X(i) −Xn)T

∥∥∥∥∥∞

. (11)

Let k ∈ 1, . . . , p. Note that because of (11), it holds that

λ > 2

∥∥∥∥∥n∑i=1

(ε(i)k,· − εk,·

n)(X(i) −Xn)T

∥∥∥∥∥∞

.

We have

1

2

n∑i=1

‖(X(i) −Xn)T ∆k,·‖22

6

∣∣∣∣∣n∑i=1

〈ε(i)k,· − εk,·n, (X(i) −Xn

)T ∆k,·〉

∣∣∣∣∣+ λ(‖A?k,·‖1 − ‖Ak,·‖1).

First consider the scalar product term of the right-hand side. We have∣∣∣∣∣n∑i=1

〈ε(i)k,·, (X(i) −Xn

)T ∆k,·〉

∣∣∣∣∣=

∣∣∣∣∣〈n∑i=1

(X(i) −Xn)(ε

(i)k,· − εk,·

n), ∆k,·〉F

∣∣∣∣∣6 ‖

n∑i=1

(X(i) −Xn)(ε

(i)k − εk,·

n)‖∞‖∆k,·‖1

6 ‖n∑i=1

(ε(i)k,· − εk,·

n)(X(i) −Xn)T ‖∞‖∆k,·‖1

6 (λ/2)‖∆k,·‖1

22

Now we deal with ‖A?k,·‖1 − ‖Ak,·‖1. Note that, by the triangle inequality,

‖Ak,·‖1 = ‖A?k,· − ∆k,·‖1 > |‖A?k,·‖1 − ‖∆k,·‖1|

> ‖A?k,·‖1 − ‖∆k,·‖1.

Using the previous, restricted to the active set S?k , we obtain

‖A?k,·‖1 − ‖Ak,·‖1 = ‖A?k,·‖1 − ‖Ak,S?k‖1 − ‖Ak,S?

k‖1

6 ‖A?k,·‖1 − (‖A?S?k‖1 − ‖∆k,S?

k‖1)− ‖Ak,S?k‖1

= ‖∆k,S?k‖1 − ‖∆k,S?

k‖1.

All this together gives

1

2

n∑i=1

‖(X(i) −Xn)T ∆k,·‖22 6 (λ/2)(‖∆k,·‖1 + 2‖∆k,S?

k‖1 − 2‖∆k,S?k‖1)

= (λ/2)(3‖∆k,S?k‖1 − ‖∆k,S?

k‖1),

and (7) follows from remarking that

n∑i=1

‖∆Tk,·(X

(i) −Xn)‖22 = n‖Σ1/2∆k,·‖22.

Proof of Equation (8). Use that for n sufficiently large, we have√n > 162(s?∞/γ

?)TU2√

log(8p2/δ)

and Lemma 13 which guarantees that ‖Σ− Σ‖∞ 6 8TU2√

log(8p2/δ)/n.


The proof is done under a simplified setting which does not involve any loss of generality. Let t ∈1, . . . , T. Since the number of folds K is fixed, we only need to show the convergence of the riskestimate based on each fold i.e., that |RI,t − R(L?t )| = OP (

√log(p)/n) for I ∈ F . Moreover both

quantities RI,t and R(L?t ) are made of two similar terms, one for each period Ut and Vt. We focus onlyon the first one, the details for the other being the same. Moreover since the index t is fixed in the wholeproof, we remove it from the notation. We note bI , AI , U instead of bI,t, AI,t, Ut. Define

δ :=K

np

∑i∈I‖Y (i)·,U − bI − AIX

(i)·,U‖

2F

−p−1E[‖Y (1)

·,U − b? −A?X(1)

·,U‖2F ]

with (b?, A?) = argminb∈Rp×t, A∈Rp×p E[‖Y (1)·,U−b−AX

(1)·,U‖

2F ]. We need to show that δ = OP (

√log(p)/n).

Write

δ 6

∣∣∣∣∣Knp∑i∈I‖Y (i)·,U − bI − AIX

(i)·,U‖

2F − E1[‖Y (1)

·,U − bI − AIX(1)·,U‖

2F ]

∣∣∣∣∣+ p−1

E1[‖Y (1)

·,U − bI − AIX(1)·,U‖

2F ]− E[‖Y (1)

·,U − b? −A?X(1)

·,U‖2F ]

:= δ1 + δ2

where the introduced expectation E1 is taken with respect to (Y (1), X(1)) only. In δ2, we recover the

excess risk for the LASSO computed with n/K observations. Hence, δ2 = OP (log(p)/n) as indicated by

Proposition 5. We now focus on δ1. Let L? : x 7→ b? + A?x be the optimal predictor as introduced in

23

Proposition 1 (with respect to the time period U). Put LI(x) = bI+AIx and ε(i)·,U = Y

(i)·,U−L

?(X(i)·,U).

Based on Proposition 1, we deduce that

E1[‖Y (1)·,U − LI(X

(1)·,U)‖

2F ] = E1[‖ε(1)·,U‖

2F ] + E1[‖(LI − L

?)(X(1)·,U)‖

2F ]

Similarly,

K

n

∑i∈I‖Y (i)·,U − LI(X

(i)·,U)‖

2F

=K

n

∑i∈I‖ε(i)·,U‖

2F −

2K

n

∑i∈I

tr(ε(i)T·,U(LI − L

?)(X(i)·,U)

)+K

n

∑i∈I‖(LI − L

?)(X(i)·,U)‖

2F

Let ∆I = AI −A?. Note that

(LI − L?)(x) = bI − b

? + ∆Ix

= Y(I)

·,U − AIX(I)

·,U − EY(1)·,U +A?EX

(1)·,U + ∆Ix

= Y(I)

·,U − EY(1)·,U − AI(X

(I)

·,U − EX(1)·,U) + ∆I(x− EX

(1)·,U)

= µ(I)Y − AI µ

(I)X + ∆I(x− EX

(1)·,U).

with µ(I)Y = Y

(I)

·,U − EY(1)·,U and µ

(I)X = X

(I)

·,U − EX(1)·,U. It holds that

2K

n

∑i∈I

tr(ε(i)T·,U(LI − L

?)(X(i)·,U)

)=

2K

n

∑i∈I

tr(

(LI − L?)(X

(i)·,U)ε

(i)T·,U

)= 2 tr

((µ

(I)Y − AI µ

(I)X )µ(I)

ε

)+ 2 tr

(∆I µ

(I)εX

)with µ

(I)ε = (K/n)

∑i∈I ε

(i)T·,U and µ

(I)εX = (K/n)

∑i∈I(X

(i)·,U − EX

(1)·,U)ε

(i)T·,U. In addition, we have

‖(LI − L?)(X

(i)·,U)‖

2F − E1‖(LI − L

?)(X(1)·,U)‖

2F

= 2 tr(

(µ(I)Y − AI µ

(I)X )T ∆I(X

(i)·,U − E[X

(1)·,U])

)+ ‖∆I(X

(i)·,U − E[X

(1)·,U])‖

2F − E‖∆I(X

(1)·,U − E[X

(1)·,U])‖

2F

which gives that

K

n

∑i∈I‖(LI − L

?)(X(i)·,U)‖

2F − E1‖(LI − L

?)(X(1)·,U)‖

2F

= 2 tr(

(µ(I)Y − AI µ

(I)X )T ∆I µ

(I)X

)+ tr(∆T

I(Σ(I) − Σ)∆I)

with µ(I)X = (K/n)

∑i∈I(X

(i)·,U−E[X

(1)·,U]), Σ(I) = (K/n)

∑i∈I(X

(i)·,U−E[X

(1)·,U])

⊗2. All this together

leads to

δ1 =K

np

∑i∈I

‖ε(i)·,U‖

2F − E[‖ε(1)·,U‖

2F ‖]

− 2

ptr(

(µ(I)Y − AI µ

(I)X )µ(I)

ε

)− 2

ptr(

∆I µ(I)εX

)(12)

+2

ptr(

(µ(I)Y − AI µ

(I)X )T ∆I µ

(I)X

)+

1

ptr(∆T

I(Σ(I) − Σ)∆I)

The conclusion of the proof will come invoking the following results which will be proven later on.

24

Proposition 7. Under the assumptions of Proposition 5, we have

maxk=1,...,p

‖AI,k −A?k‖1 = OP

(√log(p)

n

).

where AI,k is k-th line of AI , provided that λ = C√nσ2 log(p), for some constant C > 0.

Proposition 8. Under the assumptions of Proposition 5, we have

max(‖µ(I)Y ‖∞, ‖µ

(I)X ‖∞, ‖µ

(I)X ‖∞, ‖µ

(I)ε ‖∞, ‖µ

(I)εX‖∞) = OP

(√log(p)

n

)

Write ∣∣∣∣∣Knp∑i∈I

‖ε(i)·,U‖

2F − E[‖ε(1)·,U‖

2F ‖]∣∣∣∣∣ =

∣∣∣∣∣1pp∑k=1

ξk,i

∣∣∣∣∣ 6 maxk=1,...,p

|ξk,i|

with ξk,i = (K/n)∑i∈I‖ε

(i)k,U‖

2F − E[‖ε(1)k,U‖

2F ]. In virtue of the union bound, the previous term is

then OP (√

log(p)/n). The remaining terms are bounded as follows, using that | tr(AB)| 6 ‖A‖∞‖B‖1.We have that ∣∣∣tr((µ

(I)Y − AI µ

(I)X )µ(I)

ε

)∣∣∣ 6 ‖(µ(I)Y − AI µ

(I)X )‖∞‖µ(I)

ε ‖1

6 (‖µ(I)Y ‖∞ + max

k=1,...,p‖AI,k‖1‖µ

(I)X ‖∞)pT‖µ(I)

ε ‖∞

= OP

(p

log(p)

n

).

Second, we have ∣∣∣tr(∆I µ(I)εX

)∣∣∣ 6 ‖∆I‖1‖µ(I)εX‖∞ = OP

(p

log(p)

n

)Third, using that ‖µ(I)

X µ(I)TY ‖∞ 6 T‖‖µ(I)

X ‖∞‖µ(I)TY ‖∞, it holds∣∣∣tr((µ

(I)Y − AI µ

(I)X )T ∆I µ

(I)X

)∣∣∣6∣∣∣tr(µ(I)T

Y ∆I µ(I)X

)∣∣∣+∣∣∣tr(µ(I)T

X ATI

∆I µ(I)X

)∣∣∣=∣∣∣tr(µ(I)

X µ(I)TY ∆I

)∣∣∣+∣∣∣tr(µ(I)

X µ(I)TX AT

I∆I

)∣∣∣6 (‖µ(I)

X µ(I)TY ‖∞ + ‖µ(I)

X µ(I)TX AT

I‖∞)‖∆I‖1

6 (‖µ(I)X µ

(I)TY ‖∞ + ‖µ(I)

X µ(I)TX ‖∞ max

k=1,...,p‖AI,k‖1)‖∆I‖1

6 T (‖µ(I)X ‖∞‖µ

(I)TY ‖∞ + ‖µ(I)

X ‖∞‖µ(I)TX ‖∞ max

k=1,...,p‖AI,k‖1)‖∆I‖1

= OP

(p

(log(p)

n

)3/2).

Fourth, it follows that

∣∣∣tr(∆TI

(Σ(I) − Σ)∆I)∣∣∣ 6 ‖Σ(I) − Σ‖∞ max

k=1,...,p‖∆I,k‖

21 = OP

(p

(log(p)

n

)3/2).

The previous bounds can be used in (12) to prove the result.

25

Proof of Proposition 7 Without loss of generality the proof is done for A(lasso)n instead of AI (the

only difference being the sample size n instead of n/K). We build upon the proof of Proposition 5. Injectthe inequality (7) in the right-hand side of (10) to get

‖∆k,S?k‖1 6

√6|S?k |λnγ?

‖∆k,S?k‖1.

Because ∆k,S?k ∈ C(3, S

?k), it follows that

‖∆k,·‖1 6 4‖∆k,S?k‖1 6 24

|S?k |λnγ?

= C ′√σ2 log(p)

n,

for some C ′ > 0 that does not depend on (n, p), with probability at least 1− δ.

Proof of Proposition 8 The proof follows from the application of Lemma 9 and Proposition 12.

A.4 Proof of Corollary 1

Introduce the notation Rt = R(L?t ). Since t = t? if Rt? < Rt for all t 6= t?, it suffices to show thatP (Rt? −mint 6=t? Rt < 0)→ 1. We have that

Rt? −mint6=t?

Rt 6 Rt? −mint6=t?

Rt + εn.

with εn = 2 maxt=1,...,T |Rt − Rt| and from Proposition 6 εn = oP (1). But for p large enough, Rt? −mint 6=t? Rt < −ε < 0. As a consequence, P (Rt? −mint 6=t? Rt < 0) > P (εn < ε)→ 1.

Appendix B Intermediate results

Define the standardized predictors Z = Σ−1/2(X−E(X)), and for any i = 1, . . . , n, Z(i) = Σ−1/2(X(i)−E(X)).

Lemma 9. Suppose that Assumptions 1, 2, 3 and 4 are fulfilled. It holds that

‖Y n − E(Y )‖F = OP

(√p/n

).

Moreover, for any A ∈ Rq×p, it holds

‖AZn‖F = OP

(√‖A‖2F /n

).

Proof. If Mi are i.i.d. centered random matrices, we have that E(‖∑ni=1Mi‖2F ) = nE(‖M1‖2F ). It

follows that

E(‖Y n − E(Y )‖2F ) = n−1E[‖Y − E(Y )‖2F ] 6 n−1Tp maxt∈T , k∈S

Var(Yk,t).

The conclusion comes from using Chebyshev’s inequality and the Assumption 4. For the second assertion,because E[ZZT ] = Ip, we have

E(‖AZn‖2F ) = n−1 tr(AE[ZZT ]AT ) = n−1‖A‖2F .

Lemma 10. Suppose that Assumptions 1, 2, 3 and 4 are fulfilled. We have

‖n∑i=1

ε(i)‖F = OP

(√nσ2p

)‖

n∑i=1

(ε(i) − εn)(Z(i) − Zn)T ‖F = OP

(√nσ2p2

).

26

Proof. The first statement follows from

E[‖n∑i=1

ε(i)‖2F ] = nE‖ε‖2F = n

T∑t=1

p∑k=1

E[ε2t,k] 6 nTpσ2.

For the second statement, start by noting that

n∑i=1

(ε(i) − εn)(Z(i) − Zn)T =

n∑i=1

ε(i)Z(i)T − nεn(Zn)T . (13)

Then use the triangle inequality to get

‖n∑i=1

(ε(i) − εn)(Z(i) − Zn)T ‖F 6 ‖n∑i=1

ε(i)Z(i)T ‖F + n‖εn(Zn)T ‖F

6 ‖n∑i=1

ε(i)Z(i)T ‖F + ‖Zn‖F ‖n∑i=1

ε(i)‖F .

Using Lemma 9 and the first statement, we find that ‖Z‖F ‖∑ni=1 ε

(i)‖F = OP (σp), Hence, showing that

‖n∑i=1

ε(i)Z(i)T ‖F = OP (√nσp)

will conclude the proof. Using Assumption 2 and Proposition 1, we find that E[tr(ε(i)Z(i)TZ(j)ε(j)T )] = 0for all i 6= j. It follows that

E[‖n∑i=1

ε(i)Z(i)T ‖2F ] = nE[‖ε(1)Z(1)T ‖2F ].

Using the triangle inequality and Jensen’s inequality, we get

E[‖n∑i=1

ε(i)Z(i)T ‖2F ] 6 nE

( T∑t=1

‖ε(1)·,tZ(1)T·,t‖F

)2

6 nT

T∑t=1

E[‖ε(1)·,tZ(1)T·,t‖

2F ]

= nT

T∑t=1

E[tr(ε(1)·,t‖Z

(1)·,t‖

22ε

(1)Tt )]

= nT

T∑t=1

E[‖Z(1)·,t‖

22‖ε

(1)·,t‖

22]

6 nTpσ2T∑t=1

E[‖Z(1)·,t‖

22] = nTp2σ2.

The last inequality is a consequence of the definition of σ2.

Recall that

Σ = n−1n∑i=1

(X(i) −Xn)(X(i) −Xn

)T

Π = Σ1/2ΣΣ1/2.

27

Lemma 11. Suppose that Assumptions 1 and 2 are fulfilled and that

B > tr((X − E(X))TΣ−1(X − E(X))).

Then, for any (n, δ) such that n > 4B log(4pT/δ), it holds, with probability 1− δ:

‖Π− Ip‖ 6 4√n−1B log(4pT/δ).

Proof. Remark that

Π = Π− Zn ZnT

with Π = n−1∑ni=1(X − EX)(X − EX)T . Apply the triangle inequality to get

‖Π− Ip‖ 6 ‖Π− Ip‖+ ‖Zn ZnT ‖ = ‖Π− Ip‖+ ‖Zn‖22.

We shall apply Lemma 15 to control each of the two terms in the previous upper bound. First considerthe case where S(i) = (Z(i)Z(i)T − Ip)/n. We need to specify the value for L and v that we can use.Note that

‖Z(i)Z(i)T ‖ = ‖T∑t=1

Z(i)·,tZ

(i)T·,t‖ 6

T∑t=1

‖Z(i)·,tZ

(i)T·,t‖ 6

T∑t=1

‖Z(i)·,t‖

22 6 B. (14)

Using the triangle inequality and Jensen inequality, we have, for any i = 1, . . . , n,

‖Si‖ 6 n−1(‖Z(i)Z(i)T ‖+ E[‖ZZT ‖]) 6 2Bn−1.

Consequently we can take L = 2Bn−1 for the value of L. Moreover, we have

‖n∑i=1

E[(Z(i)Z(i)T − Ip)T (Z(i)Z(i)T − Ip)]‖ = ‖n∑i=1

(E[(Z(i)Z(i)TZ(i)Z(i)T ]− Ip)‖

= n‖E[ZZTZZT ]− Ip‖6 n‖E[ZZTZZT ]‖= n sup

‖u‖=1

E[uTZZTZZTu]

= n sup‖u‖=1

E[‖ZZTu‖22]

From the classic inequality ‖A1/2u‖22 6 ‖A‖‖u‖22, we deduce that ‖Au‖22 6 ‖A‖‖A1/2u‖22. Applying thiswith A = ZZT and using (14), we obtain

‖ZZTu‖22 6 ‖ZZT ‖‖(ZZT )1/2u‖22 6 B‖(ZZT )1/2u‖22.

Taking the expectation in the previous inequality, we get E[‖ZZTu‖22] 6 B‖u‖22. Hence

‖n∑i=1

E[(Z(i)Z(i)T − Ip)T (Z(i)Z(i)T − Ip)]‖ 6 nB.

Hence we can take n−1B for the value of v. Lemma 15 gives that

P(‖Π− Ip‖ > t

)6 2p exp

(−t2

2n−1B(1 + 2t/3)

).

Consequently, with probability 1− δ/2, it holds that

‖Π− Ip‖ 6√

4n−1B log(4p/δ),

28

provided that 4n−1B log(4p/δ) 6 9/4. Now we apply Lemma 15 with S(i) = Z(i)/n to provide a boundon ‖Zn‖. By (14), we have that

‖Z‖ = ‖ZZT ‖1/2 6 B1/2,

max‖E[ZZT ]‖, ‖E[ZTZ]‖

= max ‖Ip‖, p = p 6 B.

Using p+ T 6 2pT , it follows that

P(‖Zn‖ > t

)6 2pT exp

(−t2

2(Bn−1 + tB1/2n−1/3)

).

By taking t =√

4n−1B log(4pT/δ) we get that ‖Zn‖ > t with probability smaller than

2pT exp

(−(log(4pT/δ))

(1 +√

4n−1 log(4pT/δ)/3)

)6 δ/2.

provided that 4n−1 log(4pT/δ) 6 9. Hence we have shown that with probability 1− δ,

‖∆‖+ ‖Zn‖ 6√

4n−1B log(4p/δ) + 4n−1B log(4pT/δ)

6√

4n−1B log(4pT/δ)(1 +√

4n−1B log(4pT/δ))

Use that 4n−1B log(4pT/δ) 6 1 to conclude.

Lemma 12. Suppose that Assumptions 1 and 2 are fulfilled. Let U be such that maxk∈S, t∈T |Xk,t −E(Xk,t)| 6 U and maxk∈S, t∈T |εk,t| 6 U , almost surely. If T 6 p and n > 4(U2/σ2) log(6p2/δ), we havewith probability 1− δ: ∥∥∥∥∥

n∑i=1


∥∥∥∥∥∞

6 4√T 2U2σ2n log(6p2/δ)

Proof. We have∥∥∥∥∥n∑i=1


∥∥∥∥∥∞

=

∥∥∥∥∥n∑i=1

ε(i)(X(i) −Xn)T

∥∥∥∥∥∞

6

∥∥∥∥∥n∑i=1

ε(i)(X(i) − E(X))T

∥∥∥∥∥∞

+ n∥∥∥εn(X

n − E(X))T∥∥∥∞

6

∥∥∥∥∥n∑i=1

ε(i)(X(i) − E(X))T

∥∥∥∥∥∞

+ n

∥∥∥∥∥T∑t=1

ε·,tn(X·,t

n − E(X·,t))T

∥∥∥∥∥∞

6

∥∥∥∥∥n∑i=1

ε(i)(X(i) − E(X))T

∥∥∥∥∥∞

+ nT maxt=1,...,T

∥∥∥ε·,tn(X·,tn − E(X·,t))

T∥∥∥∞

6

∥∥∥∥∥n∑i=1

ε(i)(X(i) − E(X))T

∥∥∥∥∥∞

+ nT maxt=1,...,T

∥∥∥ε·,tn‖∞‖X·,tn − E(X·,t)∥∥∥∞

6

∥∥∥∥∥n∑i=1

ε(i)(X(i) − E(X))T

∥∥∥∥∥∞

+ nT‖εn‖∞∥∥∥Xn − E(X)

∥∥∥∞

29

For the first term, we apply Lemma 14 with A(i) = ε(i)(X(i)−E(X))T noting that, by Jensen inequality,

Var(

(ε(1)(X(1) − E(X))T )k,`

)= E

[(ε(1)(X(1) − E(X))T )2k,`

]= E

[(

T∑t=1

εk,t(X`,t − EX`,t))2

]

6 T

T∑t=1

E((εk,t(X`,t − EX`,t))2)

6 T 2σ2 max`∈S,t∈T

E((X`,t − EX`,t)2)

6 T 2U2σ2.

we may take v = T 2U2σ2 and c = TU2. Then using that 9T 2U2σ2n > 4T 2U4 log(2p2/(δ/3)), we havewith probability 1− δ/3,∥∥∥∥∥

n∑i=1

ε(i)(X(i) − E(X))T

∥∥∥∥∥∞

6√

4T 2U2σ2n log(6p2/δ).

Applying again Lemma 14, to the sequence (ε(i))i=1,...,n, taking v = σ2 and c = U and using 9σ2n >4U2 log(2p2/(δ/3)) > 4U2 log(2pT/(δ/3)), we deduce that with probability 1− δ/3,

‖εn‖∞ 6√

4σ2 log(6pT/δ)/n.

Applying again Lemma 14, to the sequence (X(i) − EX)i=1,...,n, taking v = U2 and c = U and using9n > 4(U2/σ2) log(2p2/(δ/3)) > 4 log(2pT/(δ/3)), we deduce that with probability 1− δ/3

‖Xn − E(X)‖∞ 6√

4U2 log(6pT/δ)/n.

All this together with the fact that T 6 p, we get∥∥∥∥∥n∑i=1


∥∥∥∥∥∞

6√

4T 2U2σ2n log(6p2/δ) + T√

4σ2 log(6p2/δ)4U2 log(6p2/δ)

=√

4T 2U2σ2n log(6p2/δ)(

1 +√

4 log(6p2/δ)/n),

and the conclusion follows from n > 4(U2/σ2) log(6p2/δ) > 4 log(6p2/δ).

Lemma 13. Suppose that Assumptions 1, 2 and 3 are fulfilled. If T 6 p and n > log(4p2/δ) we havewith probability 1− δ:

‖Σ− Σ‖∞ 68TU2√

log(4p2/δ)/n,

where U is such that maxk∈S, t∈T |Xk,t − E(Xk,t)| 6 U almost surely.

Proof. We have

Σ− Σ = n−1n∑i=1

(X(i) − E(X))(X(i) − E(X))T − Σ

− (E(X)−Xn

)(E(X)−Xn)T .

In the following, we bound each of the two previous terms under events of probability 1−δ/2, respectively.Apply Lemma 14 with A(i) = (X(i) − E(X))(X(i) − E(X))T − Σ. Note that

|A(1)k,`| 6 2|(X(1)

k,· − E(Xk,·))T (X

(1)`,· − E(X`,·))|

= 2

∣∣∣∣∣T∑t=1

(X(1)k,t − E(Xk,t))(X

(1)`,t − E(X`,t))

∣∣∣∣∣6 2TU2.

30

and

Var(A(1)k,`) 6 E

[((X

(1)k,· − E(Xk,·))

T (X(1)`,· − E(X`,·))

)2]6 (2TU2)2.

Hence we apply Lemma 14 with v = (2TU2)2 and c = 2TU2. We obtain, because 9n > 4 log(4p2/δ),with probability 1− δ/2,∥∥∥∥∥n−1

n∑i=1

(X(i) − E(X))(X(i) − E(X))T − Σ

∥∥∥∥∥∞

6 4TU2√

log(2p2/(δ/2))/n.

Note that

((E(X)−Xn)(E(X)−Xn

)T )k,` =

T∑t=1

(E(Xk,t)−Xk,tn)(E(X`,t)−X`,t

n).

It follows that ∣∣∣((E(X)−Xn)(E(X)−Xn

)T )k,`

∣∣∣ 6 T maxt=1,...,T

maxk=1,...,p

|E(Xk,t)−Xk,tn|2.

Applying Lemma 14 with A(i) = X(i)−E(X), c = U , and v = U2 we get, because 9n > 4 log(2pT/(δ/2),with probability 1− δ/2,

maxt=1,...,T

maxk=1,...,p

|E(Xk,t)−Xk,tn| 6

√4U2 log(2pT/(δ/2))/n.

Hence we have shown that, with probability 1− δ,

‖Σ− Σ‖∞ 6 4TU2√

log(4p2/δ)/n+ 4TU2 log(4pT/δ)/n

6 4TU2√

log(4p2/δ)/n(1 +√

log(4p2/δ)/n)

Invoking that n > log(4p2/δ) we obtain the result.

Appendix C Auxiliary results

Lemma 14 (Bernstein inequality). Suppose that (A(i))i>1 is an iid sequence of centered random vectors

valued in Rp×T . Suppose that ‖A(1)‖∞ 6 c a.s. and that Var(A(1)k,`) 6 v. For any δ ∈ (0, 1), n > 1 such

that 9vn > 4c2 log(2pT/δ), we have with probability 1− δ:∥∥∥∥∥n∑i=1

A(i)

∥∥∥∥∥∞

6√

4nv log(2pT/δ).

Proof. By Bernstein inequality, it follows that

P

(∣∣∣∣∣n∑i=1

A(i)k,t

∣∣∣∣∣ > x

)6 2 exp

(− x2

2(vn+ cx/3)

).

Choosing x =√

4nv log(2pT/δ) and because vn > cx/3, we get that

P

(∣∣∣∣∣n∑i=1

A(i)k,t

∣∣∣∣∣ > x

)6 δ/(pT ).

Use the union bound to get that, for any t > 0,

P

(∥∥∥∥∥n∑i=1

A(i)

∥∥∥∥∥∞

> x

)6

∑16k,t6p

P

(∣∣∣∣∣n∑i=1

A(i)k,t

∣∣∣∣∣ > x

)6 δ.

31

Lemma 15 (Matrix Bernstein inequality). Let S(1), . . . , S(n) be independent, centered random matriceswith common dimension d1 × d2, and assume that each one is uniformly bounded in spectral norm, i.e.,

E[S(i)] = 0 and∥∥∥S(i)

∥∥∥ 6 L for each i = 1, . . . , n.

Let v > 0 be such that

max

∥∥∥∥∥n∑i=1

E(S(i)S(i)T

)∥∥∥∥∥ ,∥∥∥∥∥n∑i=1

E(S(i)TS(i)

)∥∥∥∥∥

6 v.

Then, for all t > 0,

P

(‖

n∑i=1

S(i)‖ > t

)6 (d1 + d2) exp

(−t2/2v + Lt/3

).

32

arxiv.org · high dimensional regression for regenerative time-series: an application to road tra c...

Documents