minimax rates of estimation for ℓ -ballswainwrig/papers/... · the area of high-dimensional...

24
2 Minimax rates of estimation for high-dimensional linear regression over q -balls Garvesh Raskutti, Martin J. Wainwright, Senior Member, IEEE and Bin Yu, Fellow, IEEE. To appear in IEEE Transactions on Information Theory. Initial version: arXiv:0910.2042 This version June 2011. Abstract—Consider the high-dimensional linear regres- sion model y = * + w, where y R n is an observation vector, X R n×d is a design matrix with d>n, the quantity β * R d is an unknown regression vector, and w ∼N (02 I ) is additive Gaussian noise. This paper studies the minimax rates of convergence for estimating β * in either 2-loss and 2-prediction loss, assuming that β * belongs to an q -ball Bq (Rq ) for some q [0, 1]. It is shown that under suitable regularity conditions on the design matrix X, the minimax optimal rate in 2-loss and 2-prediction loss scales as Rq ( log d n ) 1- q 2 . The analysis in this paper reveals that conditions on the design matrix X enter into the rates for 2-error and 2-prediction error in complementary ways in the upper and lower bounds. Our proofs of the lower bounds are information-theoretic in nature, based on Fano’s inequality and results on the metric entropy of the balls Bq (Rq ), whereas our proofs of the upper bounds are constructive, involving direct analysis of least-squares over q -balls. For the special case q =0, corresponding to models with an exact sparsity constraint, our results show that although computationally efficient 1-based methods can achieve the minimax rates up to constant factors, they require slightly stronger assumptions on the design matrix X than optimal algorithms involving least-squares over the 0-ball. Keywords: compressed sensing; high-dimensional statistics; minimax rates; Lasso; sparse linear regression; 1 -relaxation. 1 G. Raskutti is with the Department of Statistics, University of California, Berkeley, CA, 94720 USA. e-mail: [email protected]. M. J. Wainwright and B. Yu are with the Department of Statistics and Department of Electrical Engineering and Computer Science, University of California, Berkeley. e-mail: [email protected] and [email protected]. 1 This work was partially supported by NSF grants DMS-0605165 and DMS-0907362 to MJW and BY. In addition, BY was partially supported by the NSF grant SES-0835531 (CDI), the NSFC grant 60628102 and a grant from the MSRA. In addition, MJW was supported by an Sloan Foundation Fellowship and AFOSR Grant FA9550-09-1-0466. During this work, GR was financially supported by a Berkeley Graduate Fellowship. The initial version of this work was posted as arXiv:0910.2042 in September 2009. In addition, it was presented in part at the Allerton Conference on Control, Communication and Computer in October 2009. I. I NTRODUCTION The area of high-dimensional statistical inference con- cerns the estimation in the d>n regime, where d refers to the ambient dimension of the problem and n refers to the sample size. Such high-dimensional inference problems arise in various areas of science, including astrophysics, remote sensing and geophysics, and com- putational biology, among others. In the general setting, it is impossible to obtain consistent estimators unless the ratio d/n converges to zero. However, many applications require solving inference problems with d>n, in which setting the only hope is that the data has some type of lower-dimensional structure. Accordingly, an active line of research in high-dimensional inference is based on im- posing various types of structural constraints, including sparsity, manifold structure, or Markov conditions, and then studying the performance of different estimators. For instance, in the case of models with some type of sparsity constraint, a great deal of work has studied the behavior of 1 -based relaxations. Complementary to the understanding of computa- tionally efficient procedures are the fundamental or information-theoretic limitations of statistical inference, applicable to any algorithm regardless of its computa- tional cost. There is a rich line of statistical work on such fundamental limits, an understanding of which can have two types of consequences. First, they can reveal gaps between the performance of an optimal algorithm compared to known computationally efficient methods. Second, they can demonstrate regimes in which practical algorithms achieve the fundamental limits, which means that there is little point in searching for a more effective algorithm. As we shall see, the results in this paper lead to understanding of both types. A. Problem set-up The focus of this paper is a canonical instance of a high-dimensional inference problem, namely that esti- mating a high-dimensional regression vector β R d with sparsity constraints based on observations from a linear model. In this problem, we observe a pair

Upload: others

Post on 22-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Minimax rates of estimation for ℓ -ballswainwrig/Papers/... · The area of high-dimensional statistical inference con-cerns the estimation in the d > n regime, where d refers to

2

Minimax rates of estimation forhigh-dimensional linear regression overℓq-balls

Garvesh Raskutti, Martin J. Wainwright,Senior Member, IEEEand Bin Yu,Fellow, IEEE.

To appear in IEEE Transactions on Information Theory.Initial version: arXiv:0910.2042

This version June 2011.

Abstract—Consider the high-dimensional linear regres-sion modely = Xβ∗ +w, where y ∈ Rn is an observationvector, X ∈ Rn×d is a design matrix with d > n, thequantity β∗

∈ Rd is an unknown regression vector, andw ∼ N (0, σ2I) is additive Gaussian noise. This paperstudies the minimax rates of convergence for estimatingβ∗ in either ℓ2-loss and ℓ2-prediction loss, assuming thatβ∗ belongs to an ℓq-ball Bq(Rq) for some q ∈ [0, 1]. Itis shown that under suitable regularity conditions on thedesign matrix X, the minimax optimal rate in ℓ2-loss andℓ2-prediction loss scales asRq

(

log d

n

)1−q2 . The analysis in

this paper reveals that conditions on the design matrixXenter into the rates for ℓ2-error and ℓ2-prediction errorin complementary ways in the upper and lower bounds.Our proofs of the lower bounds are information-theoreticin nature, based on Fano’s inequality and results on themetric entropy of the balls Bq(Rq), whereas our proofs ofthe upper bounds are constructive, involving direct analysisof least-squares overℓq-balls. For the special caseq = 0,corresponding to models with an exact sparsity constraint,our results show that although computationally efficientℓ1-based methods can achieve the minimax rates up toconstant factors, they require slightly stronger assumptionson the design matrixX than optimal algorithms involvingleast-squares over theℓ0-ball.

Keywords: compressed sensing; high-dimensionalstatistics; minimax rates; Lasso; sparse linear regression;ℓ1-relaxation.1

G. Raskutti is with the Department of Statistics,University of California, Berkeley, CA, 94720 USA. e-mail:[email protected]. M. J. Wainwright and B. Yu arewith theDepartment of Statistics and Department of Electrical Engineeringand Computer Science, University of California, Berkeley. e-mail:[email protected] and [email protected].

1 This work was partially supported by NSF grants DMS-0605165and DMS-0907362 to MJW and BY. In addition, BY was partiallysupported by the NSF grant SES-0835531 (CDI), the NSFC grant60628102 and a grant from the MSRA. In addition, MJW wassupported by an Sloan Foundation Fellowship and AFOSR GrantFA9550-09-1-0466. During this work, GR was financially supportedby a Berkeley Graduate Fellowship.

The initial version of this work was posted as arXiv:0910.2042 inSeptember 2009. In addition, it was presented in part at the AllertonConference on Control, Communication and Computer in October2009.

I. I NTRODUCTION

The area of high-dimensional statistical inference con-cerns the estimation in thed > n regime, whered refersto the ambient dimension of the problem andn refersto the sample size. Such high-dimensional inferenceproblems arise in various areas of science, includingastrophysics, remote sensing and geophysics, and com-putational biology, among others. In the general setting,it is impossible to obtain consistent estimators unless theratiod/n converges to zero. However, many applicationsrequire solving inference problems withd > n, in whichsetting the only hope is that the data has some type oflower-dimensional structure. Accordingly, an active lineof research in high-dimensional inference is based on im-posing various types of structural constraints, includingsparsity, manifold structure, or Markov conditions, andthen studying the performance of different estimators.For instance, in the case of models with some type ofsparsity constraint, a great deal of work has studied thebehavior ofℓ1-based relaxations.

Complementary to the understanding of computa-tionally efficient procedures are the fundamental orinformation-theoretic limitations of statistical inference,applicable to any algorithm regardless of its computa-tional cost. There is a rich line of statistical work onsuch fundamental limits, an understanding of which canhave two types of consequences. First, they can revealgaps between the performance of an optimal algorithmcompared to known computationally efficient methods.Second, they can demonstrate regimes in which practicalalgorithms achieve the fundamental limits, which meansthat there is little point in searching for a more effectivealgorithm. As we shall see, the results in this paper leadto understanding of both types.

A. Problem set-up

The focus of this paper is a canonical instance of ahigh-dimensional inference problem, namely that esti-mating a high-dimensional regression vectorβ∗ ∈ Rd

with sparsity constraints based on observations froma linear model. In this problem, we observe a pair

Page 2: Minimax rates of estimation for ℓ -ballswainwrig/Papers/... · The area of high-dimensional statistical inference con-cerns the estimation in the d > n regime, where d refers to

(y,X) ∈ Rn × Rn×d, whereX is the design matrixandy is a vector of response variables. These quantitiesare linked by the standard linear model

y = Xβ∗ + w, (1)

where w ∼ N(0, σ2In×n) is observation noise. Thegoal is to estimate the unknown vectorβ∗ ∈ Rd

of regression coefficients. The sparse instance of thisproblem, in which the regression vectorβ∗ satisfiessome type of sparsity constraint, has been investigatedextensively over the past decade. A variety of practicalalgorithms have been proposed and studied, many basedon ℓ1-regularization, including basis pursuit [11], theLasso [11, 32], and the Dantzig selector [8]. Variousauthors have obtained convergence rates for differenterror metrics, includingℓ2-norm error [4, 8, 25, 41],prediction loss [4, 16, 34], as well as model selectionconsistency [24, 36, 41, 43]. In addition, a range ofsparsity assumptions have been analyzed, including thecase ofhard sparsitymeaning thatβ∗ has exactlys ≪ dnon-zero entries, orsoft sparsityassumptions, based onimposing a certain decay rate on the ordered entries ofβ∗. Intuitively, soft sparsity means that while many of theco-efficients of the co-variates may be non-zero, manyof the co-variates only make a small overall contributionto the model, which may be more applicable in somepractical settings.

a) Sparsity constraints:One way in which to cap-ture the notion of sparsity in a precise manner is in termsof the ℓq-balls2 for q ∈ [0, 1], defined as

Bq(Rq) : =β ∈ Rd | ‖β‖qq : =

d∑

j=1

|βj |q ≤ Rq

.

Note that in the limiting caseq = 0, we have theℓ0-ball

B0(s) : =β ∈ Rd |

d∑

j=1

I[βj 6= 0] ≤ s,

which corresponds to the set of vectorsβ with at mosts non-zero elements. Forq ∈ (0, 1], membership ofβ inBq(Rq) enforces a “soft” form of sparsity, in that all ofthe coefficients ofβ may be non-zero, but their absolutemagnitude must decay at a relatively rapid rate. This typeof soft sparsity is appropriate for various applicationsof high-dimensional linear regression, including imagedenoising, medical reconstruction and database updating,in which exact sparsity is not realistic.

2Strictly speaking, these sets are not “balls” whenq < 1, since theyfail to be convex.

b) Loss functions: We consider estimatorsβ : Rn × Rn×d → Rd that are measurable functions ofthe data(y,X). Given any such estimator of the trueparameterβ∗, there are many criteria for determiningthe quality of the estimate. In a decision-theoreticframework, one introduces a loss function such thatL(β, β∗) represents the loss incurred by estimatingβ when β∗ ∈ Bq(Rq) is the true parameter. In theminimax formalism, one seeks to choose an estimatorthat minimizes the worst-case loss given by

minβ

maxβ∗∈Bq(Rq)

L(β, β∗). (2)

Note that the quantity (2) is random sinceβ dependson the noisew, and therefore, we must either providebounds that hold with high probability or in expectation.In this paper, we provide results that hold with highprobability, as shown in the statements our main resultsin results in Theorems1 through4.

Moreover, various choices of the loss function are pos-sible, including (i) themodel selection loss, which is zeroif and only if the support supp(β) of the estimate agreeswith the true support supp(β∗), and one otherwise; (ii)the ℓ2-loss

L2(β, β∗) : = ‖β − β∗‖22 =

d∑

j=1

|βj − β∗j |22, (3)

and (iii) the ℓ2-prediction loss‖X(β − β∗)‖22/n. Theinformation-theoretic limits of model selection have beenstudied extensively in past work (e.g., [1, 15, 37, 38]);in contrast, the analysis of this paper is focused onunderstanding the minimax rates associated with theℓ2-loss and theℓ2-prediction loss.

More precisely, the goal of this paper is to provideupper and lower bounds on the following four forms ofminimax risk:

M2(Bq(Rq), X) : = minβ

maxβ∗∈Bq(Rq)

‖β − β∗‖22,

M2(B0(s), X) : = minβ

maxβ∗∈B0(s)

‖β − β∗‖22,

Mn(Bq(Rq), X) : = minβ

maxβ∗∈Bq(Rq)

1

n‖X(β − β∗)‖22,

Mn(B0(s), X) : = minβ

maxβ∗∈B0(s)

1

n‖X(β − β∗)‖22.

These quantities correspond to all possible combinationsof minimax risk involving either theℓ2-loss or theℓ2-prediction loss, and with either hard sparsity (q = 0)or soft sparsity (q ∈ (0, 1]).

3

Page 3: Minimax rates of estimation for ℓ -ballswainwrig/Papers/... · The area of high-dimensional statistical inference con-cerns the estimation in the d > n regime, where d refers to

B. Our contributions

The main contributions are derivations of optimalminimax rates both forℓ2-norm andℓ2-prediction losses,and perhaps more significantly, a thorough characteriza-tion of the conditions that are required on the designmatrix X in each case. The core of the paper consistsof four main theorems, corresponding to upper andlower bounds on minimax rate for theℓ2-norm loss(Theorems1 and 2 respectively) and upper and lowerbounds onℓ2-prediction loss (Theorems3 and Theo-rem4) respectively. We note that for the linear model (1),the special case of orthogonal designX =

√nIn×n

(so thatn = d necessarily holds) has been extensivelystudied in the statistics community (for example, seethe papers [3, 6, 14] as well as references therein). Incontrast, our emphasis is on the high-dimensional settingd > n, and our goal is to obtain results for general designmatricesX.

More specifically, in Theorem1, we provide lowerbounds for theℓ2-loss that involves a maximum of twoquantities: a term involving the diameter of the null-space restricted to theℓq-ball, measuring the degree ofnon-identifiability of the model, and a term arising fromthe ℓ2-metric entropy structure forℓq-balls, measuringthe complexity of the parameter space. Theorem2 iscomplementary in nature, devoted to upper bounds thatobtained by direct analysis of a specific estimator. Weobtain upper and lower bounds that match up to factorsthat independent of the triple(n, d,Rq), but depend onconstants related to the structure of the design matrixX(see Theorems1 and2). Finally, Theorems3 and4 arefor ℓ2-prediction loss. For this loss, we provide upper andlower bounds on minimax rates that are again matchingup to factors independent of(n, d,Rq), but dependentagain on the conditions of the design matrix.

A key part of our analysis is devoted to understandingthe link between the prediction semi-norm—more pre-cisely, the quantity‖Xθ‖2/

√n—and theℓ2 norm ‖θ‖2.

In the high-dimensional setting (withX ∈ Rn×d withd ≫ n), these norms are in general incomparable, sincethe design matrixX has a null-space of dimension atleast d − n. However, for analyzing sparse linear re-gression models, it is sufficient to study the approximateequivalence of these norms only for elementsθ lying inthe ℓq-ball, and this relationship between the two semi-norms plays an important role for the proofs of both theupper and the lower bounds. Indeed, for Gaussian noisemodels, the prediction semi-norm‖X(β − β∗)‖2/

√n

corresponds to the square-root Kullback-Leibler diver-gence between the distributions ony indexed byβ andβ∗, and so reflects the discriminability of these models.

Our analysis shows that the conditions onX enter inquite a different manner forℓ2-norm and predictionlosses. In particular, for the caseq > 0, proving upperboundson ℓ2-norm error andlower boundson predictionerror require relatively strong conditions on the designmatrix X, whereaslower boundson ℓ2-norm error andupper boundson prediction error require only a verymild column normalization condition.

The proofs for the lower bounds in Theorems1and3 involve a combination of a standard information-theoretic techniques (e.g. [5, 18, 39]) with results inthe approximation theory literature (e.g., [17, 21]) onthe metric entropy ofℓq-balls. The proofs for the upperbounds in Theorems2 and 4 involve direct analysis ofthe least-squares optimization over theℓq-ball. The basicidea involves concentration results for Gaussian randomvariables and properties of theℓ1-norm overℓq-balls (seeLemma5).

The remainder of this paper is organized as follows.In SectionII , we state our main results and discuss theirconsequences. While we were writing up the results ofthis paper, we became aware of concurrent work byZhang [42], and we provide a more detailed discussionand comparison in SectionII-E, following the precisestatement of our results. In addition, we also discuss acomparison between the conditions onX imposed in ourwork, and related conditions imposed in the large bodyof work on ℓ1-relaxations. In SectionIII , we provide theproofs of our main results, with more technical aspectsdeferred to the appendices.

II. M AIN RESULTS AND THEIR CONSEQUENCES

This section is devoted to the statement of our mainresults, and discussion of some of their consequences.We begin by specifying the conditions on the high-dimensional scaling and the design matrixX that enterdifferent parts of our analysis, before giving precisestatements of our main results.

A. Assumptions on design matrices

Let X(i) denote theith row of X andXj denote thejth column ofX. Our first assumption, which remains inforce throughout most of our analysis, is that the columnsXj , j = 1, . . . , d of the design matrixX are boundedin ℓ2-norm.

Assumption 1 (Column normalization). There exists aconstant0 < κc < +∞ such that

1√n

maxj=1,...,d

‖Xj‖2 ≤ κc. (4)

4

Page 4: Minimax rates of estimation for ℓ -ballswainwrig/Papers/... · The area of high-dimensional statistical inference con-cerns the estimation in the d > n regime, where d refers to

This is a fairly mild condition, since the problemcan always be normalized to ensure that it issatisfied. Moreover, it would be satisfied with highprobability for any random design matrix for which1n‖Xj‖22 = 1

n

∑ni=1 X

2ij satisfies a sub-exponential tail

bound. This column normalization condition is requiredfor all the theorems except for achievability bounds forℓ2-prediction error whenq = 0.

We now turn to a more subtle condition on the designmatrix X:

Assumption 2 (Bound on restricted lower eigenvalue).For q ∈ (0, 1], there exists a constantκℓ > 0 and afunction fℓ(Rq, n, d) such that

‖Xθ‖2√n

≥ κℓ

(‖θ‖2 − fℓ(Rq, n, d)

)(5)

for all θ ∈ Bq(2Rq).

A few comments on this assumption are in order. Forthe caseq > 0, this assumption is imposed when derivingupper bounds for theℓ2-error and lower bounds forℓ2-prediction error. It is required inupper boundingℓ2-errorbecause for any two distinct vectorsβ, β′ ∈ Bq(Rq),the prediction semi-norm‖X(β − β′)‖2/

√n is closely

related to the Kullback-Leibler divergence, which quan-tifies how distinguishableβ is from β′ in terms of thelinear regression model. Indeed, note that for fixedXand β, the vectorY ∼ N (Xβ, σ2In×n), so that theKullback-Leibler divergence between the distributions onY indexed byβ andβ′ is given by 1

2σ2 ‖X(β − β′)‖22.Thus, the lower bound (5), when applied to the dif-ference θ = β − β′, ensures any pair(β, β′) thatare well-separated inℓ2-norm remain well-separatedin the ℓ2-prediction semi-norm. Interestingly, Assump-tion 2 is also essential in establishinglower boundsonthe ℓ2-prediction error. Here the reason is somewhatdifferent—namely, it ensures that the setBq(Rq) stillsuitably “large” when its diameter is measured in theℓ2-prediction semi-norm. As we show, it is this sizethat governs the difficulty of estimation in the predictionsemi-norm.

The condition (5) is almost equivalent to boundingthe smallest singular value ofX/

√n restricted to the

setBq(2Rq). Indeed, the only difference is the “slack”provided byfℓ(Rq, n, d). The reader might question whythis slack term is actually needed. In fact, it isessentialin the caseq ∈ (0, 1], since the setBq(2Rq) spansall directions of the spaceRd. (This is not true in thelimiting caseq = 0.) SinceX must have a non-trivial

null-space whend > n, the condition (5) can never besatisfied withfℓ(Rq, n, d) = 0 wheneverd > n andq ∈ (0, 1].

Interestingly, for appropriate choices of the slack termfℓ(Rq, n, d), the restricted eigenvalue condition is satis-fied with high probability for many random matrices, asshown by the following result.

Proposition 1. Consider a random matrixX ∈ Rn×d

formed by drawing each row i.i.d. from aN (0,Σ) dis-tribution with maximal varianceρ2(Σ) = max

j=1,...,dΣjj . If

ρ(Σ)

λmin(√Σ)

Rq

(log dn

)1/2−q/4< c1 for a sufficiently small

universal constantc1 > 0, then

‖Xθ‖2√n

≥ λmin(Σ1/2)

4‖θ‖2 − 18 ρ(Σ) Rq

( log dn

)1−q/2,

(6)

for all θ ∈ Bq(2Rq) with probability at least1− c2 exp(−c3n).

An immediate consequence of the bound (6) is thatAssumption2 holds with

fℓ(Rq, n, d) = cρ(Σ)

λmin(Σ1/2)Rq

( log dn

)1−q/2

(7)

for some universal constantc. We make use of thiscondition in Theorems2(a) and 3(a) to follow. Theproof of Proposition1, provided in AppendixA, followsas a consequence of a random matrix result in Raskuttiet al. [29]. In the same paper, on pp.2248 − 49 it isdemonstrated that there are many interesting classes ofnon-identity covariance matrices, among them Toeplitzmatrices, constant correlation matrices and spikedmodels, to which Proposition1 can be applied.

For the special caseq = 0, the following conditionsare needed for upper and lower bounds inℓ2-norm error,and lower bounds inℓ2-prediction error.

Assumption 3 (Sparse Eigenvalue Conditions).(a) There exists a constantκu < +∞ such that

1√n‖Xθ‖2 ≤ κu ‖θ‖2 for all θ ∈ B0(2s). (8)

(b) There exists a constantκ0,ℓ > 0 such that

1√n‖Xθ‖2 ≥ κ0,ℓ ‖θ‖2 for all θ ∈ B0(2s). (9)

Assumption2 adapted to the special case ofq = 0corresponding to exactly sparse models; however, inthis case, no slack termfℓ(Rq, n, d) is involved. As wediscuss at more length in SectionII-E, Assumption3is closely related to conditions imposed in analyses of

5

Page 5: Minimax rates of estimation for ℓ -ballswainwrig/Papers/... · The area of high-dimensional statistical inference con-cerns the estimation in the d > n regime, where d refers to

ℓ1-based relaxations, such as the restricted isometryproperty [8] as well as related but less restrictive sparseeigenvalue conditions [4, 25, 34]. Unlike the restrictedisometry property, Assumption3 does not require thatthe constantsκu and κ0,ℓ are close to one; indeed,they can be arbitrarily large (respectively small), aslong as they are finite and non-zero. In this sense, it ismost closely related to the sparse eigenvalue conditionsintroduced by Bickel et al. [4], and we discuss theseconnections at more length in SectionII-E. The setB0(2s) is a union of2s-dimensional subspaces, whichdoes not span all direction ofRd. Since the conditionmay be satisfied ford > n, no slack termfℓ(Rq, n, d)is required in the caseq = 0.

In addition, our lower bounds onℓ2-error in-volve the set defined by intersecting the null space(or kernel) of X with the ℓq-ball, which we de-note by Nq(X) : = Ker(X) ∩ Bq(Rq). We define theBq(Rq)-kernel diameterin the ℓ2-norm as

diam2(Nq(X)) : = maxθ∈Nq(X)

‖θ‖2 = max‖θ‖q

q≤Rq,

Xθ=0

‖θ‖2.

(10)

The significance of this diameter should be apparent: forany “perturbation”∆ ∈ Nq(X), it follows immediatelyfrom the linear observation model (1) that no methodcould ever distinguish betweenβ∗ = 0 and β∗ = ∆.Consequently, thisBq(Rq)-kernel diameter is a measureof the lack of identifiabilityof the linear model (1) overthe setBq(Rq).

It is useful to recognize that Assumptions2 and 3are closely related to the diameter condition (10); inparticular, these assumptions imply an upper boundbound on theBq(Rq)-kernel diameter inℓ2-norm, andhence limit the lack of identifiability of the model.

Lemma 1 (Bounds on non-identifiability).(a) Caseq ∈ (0, 1]: If Assumption2 holds, then theBq(Rq)-kernel diameter is upper bounded as

diam2(Nq(X)) = O(fℓ(Rq, n, d)).

(b) Caseq = 0: If Assumption3(b) is satisfied, thendiam2(N0(X)) = 0. (In words, the only element ofB0(2s) in the kernel ofX is the0-vector.)

These claims follow in a straightforward way fromthe definitions given in the assumptions. In SectionII-E,we discuss further connections between our assumptions,and the conditions imposed in analysis of the Lasso andotherℓ1-based methods [4, 8, 24, 26], for the caseq = 0.

B. Universal constants and non-asymptotic statements

Having described our assumptions on the design ma-trix, we now turn to the main results that provide upperand lower bounds on minimax rates. Before doing so,let us clarify our use of universal constants in ourstatements. Our main goal is to track the dependenceof minimax rates on the triple(n, d,Rq), as well as thenoise varianceσ2 and the properties of the design matrixX. In our statement of the minimax rates themselves,we usec to denote a universal positive constant thatis independent of(n, d,Rq), the noise varianceσ2 andthe parameters of the design matrixX. In this way,our minimax rates explicitly track the dependence ofall of these quantities in a non-asymptotic manner. Insetting up the results, we also state certain conditionsthat involve a separate set of universal constants denotedc1, c2 etc.; these constants are independent of(n, d,Rq)but may depend on properties of the design matrix.

In this paper, our primary interest is the high-dimensional regime in whichd ≫ n. Our theory isnon-asymptotic, applying to all finite choices of thetriple (n, d,Rq). Throughout the analysis, we impose thefollowing conditions on this triple. In the caseq = 0,we require that the sparsity indexs = R0 satisfiesd ≥ 4s ≥ c2. These bounds ensure that our probabilisticstatements are all non-trivial (i.e., are violated withprobability less than1). For q ∈ (0, 1], we require thatfor some choice of universal constantsc1, c2 > 0 andδ ∈ (0, 1), the triple(n, d,Rq) satisfies

d

Rqnq/2

(i)

≥ c1 dδ(ii)

≥ c2. (11)

The condition (ii) ensures that the dimensiond is suffi-ciently large so that our probabilistic guarantees are allnon-trivial (i.e., hold with probability strictly less than1). In the regimed > n that is of interest in this paper,the condition (i) on(n, d,Rq) is satisfied as long as theradiusRq does not grow too quickly in the dimensiond. (As a concrete example, the boundRq ≤ c3d

12−δ′ for

someδ′ ∈ (0, 1/2) is one sufficient condition.)

C. Optimal minimax rates inℓ2-norm loss

We are now ready to state minimax bounds, and webegin with lower bounds on theℓ2-norm error:

Theorem 1 (Lower bounds onℓ2-norm error). Considerthe linear model(1) for a fixed design matrixX ∈ Rn×d.

(a) Case q ∈ (0, 1]: Suppose thatX is column-normalized (Assumption1 holds with κc < ∞), and

6

Page 6: Minimax rates of estimation for ℓ -ballswainwrig/Papers/... · The area of high-dimensional statistical inference con-cerns the estimation in the d > n regime, where d refers to

Rq(log dn )1−q/2 < c1 for a universal constantc1. Then

M2(Bq(Rq), X) ≥

c maxdiam2

2(Nq(X)), Rq

(σ2

κ2c

log d

n

)1−q/2(12)

with probability greater than1/2.

(b) Caseq = 0: Suppose that Assumption3(a) holdswith κu > 0, and s log(d/s)

n < c1 for a universal constantc1. Then

M2(B0(s), X) ≥

c maxdiam2

2(N0(X)),σ2

κ2u

s log(d/s)

n

(13)

with probability greater than1/2.

The choice of probability1/2 is a standard con-vention for stating minimax lower bounds on rates.3

Note that both lower bounds consist of two terms.The first term corresponds to the diameter of the setNq(X) = Ker(X) ∩ Bq(Rq), a quantity which reflectsthe extent which the linear model (1) is unidentifiable.Clearly, one cannot estimateβ∗ any more accuratelythan the diameter of this set. In both lower bounds, theratios σ2/κ2

c (or σ2/κ2u) correspond to the inverse of

the signal-to-noise ratio, comparing the noise varianceσ2 to the magnitude of the design matrix measured byκu, since constantscq and c0 do not depend on thedesignX. As the proof will clarify, the term[log d]1−

q2

in the lower bound (12), and similarly the termlog(ds )in the bound (13), are reflections of the complexity ofthe ℓq-ball, as measured by its metric entropy. For manyclasses of random Gaussian design matrices, the secondterm is of larger order than the diameter term, and hencedetermines the rate.

We now state upper bounds on theℓ2-norm minimaxrate over ℓq balls. For these results, we require thecolumn normalization condition (Assumption1), andAssumptions2 and 3. The upper bounds are provenby a careful analysis of constrained least-squares overthe setBq(Rq)—namely, the estimator

β ∈ arg minβ∈Bq(Rq)

‖y −Xβ‖22. (14)

Theorem 2 (Upper bounds onℓ2-norm loss). Considerthe model(1) with a fixed design matrixX ∈ Rn×d thatis column-normalized (Assumption1 with κc < ∞).

3This probability may be made arbitrarily close to1 by suitablymodifying the constants in the statement.

(a) For q ∈ (0, 1]: Suppose thatRq(log dn )1−q/2 <

c1 and X satisfies Assumption2 with κℓ > 0 andfℓ(Rq, n, d) ≤ c2Rq(

log dn )1−q/2. Then

M2(Bq(Rq), X) ≤ c Rq

[κ2c

κ2ℓ

σ2

κ2ℓ

log d

n

]1−q/2

(15)

with probability greater than1− c3 exp (−c4 log d).

(b) For q = 0: Suppose thatX satisfies Assumption3(b)with κ0,ℓ > 0. Then

M2(B0(s), X) ≤ cκ2c

κ20,ℓ

σ2

κ20,ℓ

s log d

n(16)

with probability greater than1 − c1 exp (−c2 log d). If,in addition, the design matrix satisfies Assumption3(a)with κu < ∞, then

M2(B0(s), X) ≤ cκ2u

κ20,ℓ

σ2

κ20,ℓ

s log(d/s)

n, (17)

this bound holding with probability greater than1− c1 exp (−c2s log(d/s)).

In the case ofℓ2-error and design matricesX thatsatisfy the assumptions of both Theorems1 and2, theseresults identify the minimax optimal rate up to constantfactors. In particular, forq ∈ (0, 1], the minimax rate inℓ2-norm scales as

M2(Bq(Rq), X) = Θ(Rq

[σ2 log d

n

]1−q/2), (18)

whereas forq = 0, the minimaxℓ2-norm rate scales as

M2(B0(s), X) = Θ(σ2 s log(d/s)

n

). (19)

D. Optimal minimax rates inℓ2-prediction norm

In this section, we investigate minimax rates in termsof the ℓ2-prediction loss‖X(β − β∗)‖22/n, and provideboth lower and upper bounds on it. The rates match therates forℓ2, but the conditions on design matrixX enterthe upper and lower bounds in a different way, and wediscuss these complementary roles in SectionII-F.

Theorem 3 (Lower bounds on prediction error).Consider the model(1) with a fixed design matrixX ∈ Rn×d that is column-normalized (Assumption1with κc < ∞).

(a) For q ∈ (0, 1]: Suppose thatRq(log dn )1−q/2 < c1,

and the design matrixX satisfies Assumption2 withκℓ > 0 and fℓ(Rq, n, d) < c2Rq(

log dn )1−q/2. Then

Mn(Bq(Rq), X) ≥ c Rq κ2ℓ

[σ2

κ2c

log d

n

]1−q/2

(20)

7

Page 7: Minimax rates of estimation for ℓ -ballswainwrig/Papers/... · The area of high-dimensional statistical inference con-cerns the estimation in the d > n regime, where d refers to

with probability at least1/2.

(b) For q = 0: Suppose thatX satisfies Assumption3(b)with κ0,ℓ > 0 and Assumption3(a) with κu < ∞, andthat s log(d/s)

n < c1, for some universal constantc1. Then

Mn(B0(s), X) ≥ c κ20,ℓ

σ2

κ2u

s log(d/s)

n(21)

with probability least1/2.

In the other direction, we state upper bounds obtainedvia analysis of least-squares constrained to the ballBq(Rq), a procedure previously defined (14).

Theorem 4 (Upper bounds on prediction error).Consider the model(1) with a fixed design matrixX ∈ Rn×d.

(a) Case q ∈ (0, 1]: If X satisfies the column nor-malization condition, then with probability at least1− c1 exp (−c2Rq(log d)

1−q/2nq/2), we have

Mn(Bq(Rq), X) ≤ c κ2c Rq

[σ2

κ2c

log d

n

]1− q2

. (22)

(b) Caseq = 0: For anyX, with probability greater than1− c1 exp (−c2s log(d/s)), we have

Mn(B0(s), X) ≤ cσ2 s log(d/s)

n. (23)

We note that Theorem 4(b) was stated and proven inBunea et. al [7] (see Theorem 3.1). However, we haveincluded the statement here for completeness and so asto facilitate discussion.

E. Some remarks and comparisons

In order to provide the reader with some intuition, letus make some comments about the scalings that appearin our results. We comment on the conditions we imposeon X in the next section.

• For the caseq = 0, there is a concrete inter-pretation of the rates log(d/s)n , which appears inTheorems1(b), 2(b), 3(b) and4(b). Note that thereare

(ds

)subsets of sizes within 1, 2, . . . , d, and

by standard bounds on binomial coefficients [13],we have log

(ds

)= Θ(s log(d/s)). Consequently,

the rates log(d/s)n corresponds to the log number of

models divided by the sample sizen. Note that inthe regime whered/s ∼ dγ for someγ > 0, thisrate is equivalent (up to constant factors) tos log d

n .• For q ∈ (0, 1], the interpretation of the rate

Rq

(log dn

)1−q/2, which appears in parts (a) of The-

orems1 through 4 can be understood as follows.

Suppose that we choose a subset of sizesq ofcoefficients to estimate, and ignore the remainingd − sq coefficients. For instance, if we were tochoose the topsq coefficients ofβ∗ in absolutevalue, then the fast decay imposed by theℓq-ballcondition on β∗ would mean that the remainingd − sq coefficients would have relatively little im-pact. With this intuition, the rate forq > 0 can beinterpreted as the rate that would be achieved bychoosingsq = Rq

(log dn

)−q/2, and then acting as

if the problem were an instance of a hard-sparseproblem (q = 0) with s = sq. For such a problem,we would expect to achieve the ratesq log d

n , which

is exactly equal toRq

(log dn

)1−q/2. Of course, we

have only made a very heuristic argument here; wemake this truncation idea and the optimality of theparticular choicesq precise in Lemma5 to followin the sequel.

• It is also worthwhile considering the form of ourresults in the special case of the Gaussian sequencemodel, for whichX =

√nIn×n and d = n. With

these special settings, our results yields the samescaling (up to constant factors) as seminal workby Donoho and Johnstone [14], who determinedminimax rates forℓp-losses overℓq-balls. Our workapplies to the case of generalX, in which thesample sizen need not be equal to the dimen-sion d; however, we re-capture the same scaling(Rq(

lognn )1−q/2)) as Donoho and Johnstone [14]

when specialized to the caseX =√nIn×n

and ℓp = ℓ2. Other work by van de Geer andLoubes [35] derives bounds on prediction error forgeneral thresholding estimators, again in the cased = n, and our results agree in this particular caseas well.

• As noted in the introduction, during the processof writing up our results, we became aware ofconcurrent work by Zhang [42] on the problem ofdetermining minimax upper and lower bounds forℓp-losses withℓq-sparsity forq > 0 and p ≥ 1.There are notable differences between our andZhang’s results. First, we treat theℓ2-predictionloss not covered by Zhang, and also show how as-sumptions on the designX enter in complementaryways for ℓ2-loss versus prediction loss. We alsohave results for the important case of hard sparsity(q = 0), not treated in Zhang’s paper. On the otherhand, Zhang provides tight bounds for generalℓp-losses (p ≥ 1), not covered in this paper. It is alsoworth noting that the underlying proof techniquesfor the lower bounds are very different. We use

8

Page 8: Minimax rates of estimation for ℓ -ballswainwrig/Papers/... · The area of high-dimensional statistical inference con-cerns the estimation in the d > n regime, where d refers to

a direct information-theoretic approach based onFano’s method and metric entropy ofℓq-balls. Incontrast, Zhang makes use of an extension of theBayesian least favorable prior approach used byDonoho and Johnstone [14]. Theorems 1 and 2 fromhis paper [42] (in the casep = 2) are similar toTheorems 1(a) and 2(a) in our paper, but the condi-tions on the design matrixX imposed by Zhang aredifferent from the ones imposed here. Furthermore,the conditions in Zhang are not directly comparableso it is difficult to say whether our conditions arestronger or weaker than his.

• Finally, in the special casesq = 0 and q = 1,subsequent work by Rigollet and Tsybakov [30] hasyielded sharper results on the prediction error (com-pare our Theorems 3 and 4 to equations (5.24) and(5.25) in their paper). They explicitly take effectsof the rank ofX into account, yielding tighter ratesin the case rank(X) ≪ n. In contrast, our resultsare based on the assumption rank(X) = n, whichholds in many cases of interest.

F. Role of conditions onX

In this subsection, we discuss the conditions on thedesign matrixX involved in our analysis, and thedifferent roles that they play in upper/lower bounds anddifferent losses.

1) Upper and lower bounds require complementaryconditions: It is worth noting that the minimax ratesfor ℓ2-prediction error andℓ2-norm error are essentiallythe same except that the design matrix structure entersminimax rates invery different ways. In particular, notethat proving lower bounds on prediction error forq > 0requires imposing relatively strong conditions on thedesignX—namely, Assumptions1 and 2 as stated inTheorem 3. In contrast, obtaining upper bounds onprediction error requires very mild conditions. At themost extreme, the upper bound forq = 0 in Theorem3requires no assumptions onX while for q > 0 only thecolumn normalization condition is required. All of thesestatements are reversed forℓ2-norm losses, where lowerbounds forq > 0 can be proved with only Assumption1on X (see Theorem1), whereas upper bounds requireboth Assumptions1 and2.

In order to appreciate the difference between theconditions forℓ2-prediction error andℓ2 error, it is usefulto consider a toy but illuminating example. Consider thelinear regression problem defined by a design matrixX =

[X1 X2 · · · Xd

]with identical columns—

that is,Xj = X1 for all j = 1, . . . , d. We assume thatvector X1 ∈ Rd is suitably scaled so that the column-

normalization condition (Assumption1) is satisfied. Forthis particular choice of design matrix, the linear obser-vation model (1) reduces toy = (

∑dj=1 β

∗j )X1+w. For

the case of hard sparsity (q = 0), an elementary argumentshows that the minimax rate inℓ2-prediction error scalesasΘ( 1n ). This scaling implies that the upper bound (23)from Theorem4 holds (but is not tight). It is trivial toprove the correct upper bounds for prediction error usingan alternative approach.4 Consequently, this highlydegenerate design matrix yields a very easy problemfor ℓ2-prediction, since the1/n rate is essentially low-dimensional parametric. In sharp contrast, for the case ofℓ2-norm error (still with hard sparsityq = 0), the modelbecomes unidentifiable. To see the lack of identifiability,let ei ∈ Rd denote the unit-vector with1 in position i,and consider the two regression vectorsβ∗ = c e1 andβ = c e2, for some constantc ∈ R. Both choices yieldthe same observation vectory, and since the choice ofc is arbitrary, the minimaxℓ2-error is infinite. In thiscase, the lower bound (13) on ℓ2-error from Theorem1holds (and is tight, since the kernel diameter is infinite).In contrast, the upper bound (16) on ℓ2-error fromTheorem2(b) does not apply, because Assumption3(b)is violated due to the extreme degeneracy of the designmatrix.

2) Comparison to conditions required forℓ1-basedmethods: Naturally, our work also has some connec-tions to the vast body of work onℓ1-based methodsfor sparse estimation, particularly for the case of hardsparsity (q = 0). Based on our results, the rates thatare achieved byℓ1-methods, such as the Lasso and theDantzig selector, are minimax optimal up to constantfactors forℓ2-norm loss, andℓ2-prediction loss. Howeverthe bounds onℓ2-error andℓ2-prediction error for theLasso and Dantzig selector require different conditionson the design matrix. We compare the conditions thatwe impose in our minimax analysis in Theorem2(b) tovarious conditions imposed in the analysis ofℓ1-basedmethods, including the restricted isometry property ofCandes and Tao [8], the restricted eigenvalue conditionimposed in Meinshausen and Yu [25], the partial Rieszcondition in Zhang and Huang [41] and the restrictedeigenvalue condition of Bickel et al. [4]. We find thatin the case wheres is known, “optimal” methods whichare based on minimizing least-squares directly over theℓ0-ball, can succeed for design matrices whereℓ1-basedmethods are not known to work forq = 0, as we discussat more length in SectionII-F2 to follow. As noted

4Note that the lower bound (21) on the ℓ2-prediction error fromTheorem3 does not apply to this model, since this degenerate designmatrix with identical columns does not satisfy Assumption3(b).

9

Page 9: Minimax rates of estimation for ℓ -ballswainwrig/Papers/... · The area of high-dimensional statistical inference con-cerns the estimation in the d > n regime, where d refers to

by a reviewer, unlike the direct methods that we haveconsidered,ℓ1-based methods typically do not assumeany prior knowledge of the sparsity index, but they dorequire knowledge or estimation of the noise variance.

One set of conditions, known as the restricted isometryproperty [8] or RIP for short, is based on very strongconstraints on the condition numbers of all sub-matricesof X up to size2s, requiring that they be near-isometries(i.e., with condition numbers close to1). Such conditionsare satisfied by matrices with columns that are all veryclose to orthogonal (e.g., whenX has i.i.d. N(0, 1)entries andn = Ω(log

(d2s

))), but are violated for

many reasonable matrix classes (e.g., Toeplitz matrices)that arise in statistical practice. Zhang and Huang [41]imposed a weaker sparse Riesz condition, based onimposing constraints (different from those of RIP) onthe condition numbers of all submatrices ofX up to asize that grows as a function ofs andn. Meinshausenand Yu [25] impose a bound in terms of the conditionnumbers or minimum and maximum restricted eigenval-ues for submatrices ofX up to sizes log n. It is unclearwhether the conditions in Meinshausen and Yu [25] areweaker or stronger than the conditions in Zhang andHuang [41]. Bickel et al. [4] show that their restrictedeigenvalue condition is less severe than both the RIPcondition [8] and an earlier set of restricted eigenvalueconditions due to Meinshausen and Yu [25].

Here we state a restricted eigenvalue condition that isvery closely related to the condition imposed in Bickelet. al [4], and as shown by Negahban et. al [26], and issufficient for bounding theℓ2-error in the Lasso algo-rithm. In particular, for a given subsetS ⊂ 1, . . . , dand constantα ≥ 1, let us define the set

C(S;α) : =θ ∈ Rd | ‖θSc‖1 ≤ α ‖θS‖1 + 4‖β∗

Sc‖1,

(24)

where β∗ is the true parameter. Note that forq = 0,the term‖β∗

Sc‖1 = 0 which is very closely related tothe restricted eigenvalue condition in Bickel et al. [4],while for q ∈ (0, 1], this term is non-zero. With thisnotation, the restricted eigenvalue condition in Negahbanet al. [26] can be stated as follows: there exists a constantκ > 0 such that

1√n‖Xθ‖2 ≥ κ‖θ‖2 for all θ ∈ C(S; 3).

Negahban et. al [26] show that under this restrictedeigenvalue condition, the Lasso estimator has squaredℓ2-error upper bounded byO

(Rq(

log dn )1−q/2

). (To be

clear, Negahban et al. [26] study a more general classof M -estimators, and impose a condition known asrestricted strong convexity; however, it reduces to an RE

condition in this special case.) For the caseq ∈ (0, 1],the analogous restricted lower eigenvalue condition weimpose is Assumption2. Recall that this states that forq ∈ (0, 1], the eigenvalues restricted to the set

θ ∈ Rd | θ ∈ Bq(2Rq) and ‖θ‖2 ≥ fℓ(Rq, n, d)

remain bounded away from zero. Both conditions imposelower bounds on the restricted eigenvalues over sets ofweakly sparse vectors.

3) Comparison with restricted eigenvalue condition:It is interesting to compare the restricted eigenvaluecondition in Bickel et al. [4] with the condition un-derlying Theorem2, namely Assumption3(b). In thecase q = 0, the condition required by the estimatorthat performs least-squares over theℓ0-ball—namely,the form of Assumption3(b) used in Theorem2(b)—is not stronger than the restricted eigenvalue conditionin Bickel et al. [4]. This fact was previously establishedby Bickel et al. (see p.7, [4]). We now provide a simplepedagogical example to show that theℓ1-based relaxationcan fail to recover the true parameter while the optimalℓ0-based algorithm succeeds. In particular, let us assumethat the noise vectorw = 0, and consider the designmatrix

X =

[1 −2 −12 −3 −3

],

corresponding to a regression problem withn = 2 andd = 3. Say that the regression vectorβ∗ ∈ R3 is hardsparse with one non-zero entry (i.e.,s = 1). Observe thatthe vector∆ :=

[1 1/3 1/3

]belongs to the null-

space ofX, and moreover∆ ∈ C(S; 3) but ∆ /∈ B0(2).All the 2×2 sub-matrices ofX have rank two, we haveB0(2) ∩ ker(X) = 0, so that by known results fromCohen et. al. [12] (see, in particular, their Lemma 3.1),the conditionB0(2) ∩ ker(X) = 0 implies that (inthe noiseless settingw = 0), the ℓ0-based algorithm canexactly recover any1-sparse vector. On the other hand,suppose that, for instance, the true regression vector isgiven by β∗ =

[1 0 0

]. If the Lasso were applied

to this problem with no noise, it would incorrectlyrecover the solutionβ : =

[0 −1/3 −1/3

], since

‖β‖1 = 2/3 < 1 = ‖β∗‖1.Although this example is low-dimensional with

(s, d) = (1, 3), higher-dimensional examples of designmatrices that satisfy the conditions required for the min-imax rate but not satisfied forℓ1-based methods may beconstructed using similar arguments. This constructionhighlights that there are instances of design matricesX for which ℓ1-based methods fail to recover the trueparameterβ∗ for q = 0 while the optimal ℓ0-basedalgorithm succeeds.

10

Page 10: Minimax rates of estimation for ℓ -ballswainwrig/Papers/... · The area of high-dimensional statistical inference con-cerns the estimation in the d > n regime, where d refers to

In summary, for the hard sparsity caseq = 0, methodsbased onℓ1-relaxation can achieve the minimax optimalrateO

(s log d

n

)for ℓ2-error. However the current analyses

of theseℓ1-methods [4, 8, 25, 34] are based on imposingstronger conditions on the design matrixX than thoserequired by the estimator that performs least-squaresover theℓ0-ball with s known.

III. PROOFS OF MAIN RESULTS

In this section, we provide the proofs of our maintheorems, with more technical lemmas and their proofsdeferred to the appendices. To begin, we provide a high-level overview that outlines the main steps of the proofs.

A. Basic steps for lower bounds

The proofs for the lower bounds follow aninformation-theoretic method based on Fano’s inequal-ity [13], as used in classical work on nonparametricestimation [20, 39, 40]. A key ingredient is a sharpcharacterization of the metric entropy structure ofℓqballs [10, 21]. At a high-level, the proof of each lowerbound follows three basic steps. The first two steps aregeneral and apply to all the lower bounds in this paper,while the third is different in each case:

(1) In order to lower bound the minimax risk in somenorm ‖ · ‖∗, we let M(δn,Bq(Rq)) be the cardi-nality of a maximal packing of the ballBq(Rq) inthe norm‖ · ‖∗, say with elementsβ1, . . . , βM.A precise definition of a packing set is provided inthe next section. A standard argument (e.g., [19,39, 40]) yields a lower bound on the minimaxrate in terms of the error in a multi-way hypothe-sis testing problem: in particular, the probabilityP[minβ maxβ∈Bq(Rq) ‖β − β‖2∗ ≥ δ2n/4

]is at

most minβ P[β 6= B], where the random vectorB ∈ Rd is uniformly distributed over the packingsetβ1, . . . , βM, and the estimatorβ takes valuesin the packing set.

(2) The next step is to derive a lower bound onP[B 6= β]; in this paper, we make use of Fano’sinequality [13]. SinceB is uniformly distributedover the packing set, we have

P[B 6= β] ≥ 1− I(y;B) + log 2

logM(δn,Bq(Rq)),

whereI(y;B) is the mutual information betweenrandom parameterB in the packing set and theobservation vectory ∈ Rn. (Recall that for tworandom variablesX andY , the mutual informationis given byI(X,Y ) = EY [D(PX|Y ‖PX)].) The

distribution PY |B is the conditional distributionof Y on B, whereB is the uniform distributionon β over the packing set andY is the gaussiandistribution induced by model (1).

(3) The final and most challenging step involves upperboundingI(y;B) so thatP[β 6= B] ≥ 1/2. Foreach lower bound, the approach to upper boundingI(y;B) is slightly different. Our proof forq = 0 isbased on Generalized Fano method [18], whereasfor the caseq ∈ (0, 1], we upper boundI(y;B)by a more intricate technique introduced by Yangand Barron [39]. We derive an upper bound onthe ǫn-covering set forBq(Rq) with respect tothe ℓ2-prediction semi-norm. Using Lemma3 inSectionIII-C2 and the column normalization con-dition (Assumption1), we establish a link betweencovering numbers inℓ2-prediction semi-norm tocovering numbers inℓ2-norm. Finally, we choosethe free parametersδn > 0 and ǫn > 0 so as tooptimize the lower bound.

B. Basic steps for upper bounds

The proofs for the upper bounds involve direct anal-ysis of the natural estimator that performs least-squaresover theℓq-ball:

β ∈ arg min‖β‖q

q≤Rq

‖y −Xβ‖22.

The proof is constructive and involves two steps, thefirst of which is standard while the second step is morespecific to each problem:

(1) Since‖β∗‖qq ≤ Rq by assumption, it is feasible forthe least-squares problem, meaning that we have‖y − Xβ‖22 ≤ ‖y − Xβ∗‖22. Defining the errorvector∆ = β − β∗ and performing some algebra,we obtain the inequality

1

n‖X∆‖22 ≤ 2|wTX∆|

n.

(2) The second and more challenging step involvescomputing upper bounds on the supremum of theGaussian process overBq(2Rq), which allows us

to upper bound|wTX∆|n . For each of the upper

bounds, our approach is slightly different in thedetails. Common steps include upper bounds onthe covering numbers of the ballBq(2Rq), as wellas on the image of these balls under the mappingX : Rd → Rn. We also make use of some chainingand peeling results from empirical process theory(e.g., van de Geer [33]). For upper bounds inℓ2-norm error (Theorem2), Assumptions2 for q > 0

11

Page 11: Minimax rates of estimation for ℓ -ballswainwrig/Papers/... · The area of high-dimensional statistical inference con-cerns the estimation in the d > n regime, where d refers to

and 3(b) for q = 0 are used to upper bound‖∆‖22in terms of 1

n‖X∆‖22.

C. Packing, covering, and metric entropy

The notion of packing and covering numbers play acrucial role in our analysis, so we begin with some back-ground, with emphasis on the case of covering/packingfor ℓq-balls in ℓ2 metric.

Definition 1 (Covering and packing numbers). Considera compact metric space consisting of a setS and a metricρ : S × S → R+.

(a) An ǫ-covering ofS in the metricρ is a collectionβ1, . . . , βN ⊂ S such that for allβ ∈ S, thereexists somei ∈ 1, . . . , N with ρ(β, βi) ≤ ǫ. Theǫ-covering numberN(ǫ;S, ρ) is the cardinality ofthe smallestǫ-covering.

(b) A δ-packing ofS in the metricρ is a collectionβ1, . . . , βM ⊂ S such thatρ(βi, βj) > δ forall i 6= j. Theδ-packing numberM(δ;S, ρ) is thecardinality of the largestδ-packing.

It is worth noting that the covering and packingnumbers are (up to constant factors) essentially the same.In particular, the inequalities

M(ǫ;S, ρ) ≤ N(ǫ;S, ρ) ≤ M(ǫ/2;S, ρ)are standard (e.g., [27]). Consequently, given upper andlower bounds on the covering number, we can immedi-ately infer similar upper and lower bounds on the pack-ing number. Of interest in our results is the logarithm ofthe covering numberlogN(ǫ;S, ρ), a quantity known asthe metric entropy.

A related quantity, frequently used in the operatortheory literature [10, 21, 31], are the (dyadic) entropynumbersǫk(S; ρ), defined as follows fork = 1, 2, . . .

ǫk(S; ρ) : = infǫ > 0 | N(ǫ;S, ρ) ≤ 2k−1

. (25)

By definition, note that we haveǫk(S; ρ) ≤ δ ifand only if log2 N(δ;S, ρ) ≤ k. For the remainder ofthis paper, the only metric used will beρ = ℓ2, so tosimplify notation, we denote theℓ2-packing and coveringnumbers byM(ǫ;S) andN(ǫ;S).

1) Metric entropies ofℓq-balls: Central to our proofsis the metric entropy of the ballBq(Rq) when themetric ρ is theℓ2-norm, a quantity which we denote bylogN(ǫ;Bq(Rq)). The following result, which providesupper and lower bounds on this metric entropy thatare tight up to constant factors, is an adaptation ofresults from the operator theory literature [17, 21]; seeAppendixB for the details. All bounds stated here applyto a dimensiond ≥ 2.

Lemma 2. For q ∈ (0, 1] there is a con-stant Uq, depending only onq, such that for all

ǫ ∈ [UqRq1/q

(log dd

) 2−q2q , Rq

1/q], we have

logN(ǫ;Bq(Rq)) ≤ Uq

(Rq

22−q

(1ǫ

) 2q2−q log d

). (26)

Conversely, suppose in addition thatǫ < 1 andǫ2 = Ω

(R

2/(2−q)q

log ddν

)1− q2 for some fixedν ∈ (0, 1),

depending only onq. Then there is a constantLq ≤ Uq,depending only onq, such that

logN(ǫ;Bq(Rq)) ≥ Lq

(Rq

22−q

(1ǫ

) 2q2−q log d

). (27)

Remark: In our application of the lower bound (27), ourtypical choice ofǫ2 will be of the orderO

(log dn

)1− q2 .

It can be verified that under the condition (11) fromSectionII-B, we are guaranteed thatǫ lies in the rangerequired for the upper and lower bounds (26) and (27)to be valid.

2) Metric entropy ofq-convex hulls: The proofs ofthe lower bounds all involve the Kullback-Leibler (KL)divergence between the distributions induced by differentparametersβ andβ′ in Bq(Rq). Here we show that forthe linear observation model (1), these KL divergencescan be represented asq-convex hulls of the columnsof the design matrix, and provide some bounds on theassociated metric entropy.

For two distributionsP and Q that have densi-ties dP and dQ with respect to some base measureµ, the Kullback-Leibler (KL) divergence is given byD(P ‖Q) =

∫log dP

dQ P(dµ). We usePβ to denotethe distribution ofy ∈ R under the linear regressionmodel—in particular, it corresponds to the distributionof a N(Xβ, σ2In×n) random vector. A straightforwardcomputation then leads to

D(Pβ ‖Pβ′) =1

2σ2‖Xβ −Xβ′‖22.

Note that the KL-divergence is proportional to thesquared prediction semi-norm. Hence control of KL-divergences are equivalent up to constant to control ofthe prediction semi-norm. Control of KL-divergencesrequires understanding of the metric entropy of theq-convex hull of the rescaled columns of the design matrixX. In particular, we define the set

absconvq(X/√n) : =

1√n

d∑

j=1

θjXj | θ ∈ Bq(Rq).

(28)

We have introduced the normalization by1/√n for later

technical convenience.

12

Page 12: Minimax rates of estimation for ℓ -ballswainwrig/Papers/... · The area of high-dimensional statistical inference con-cerns the estimation in the d > n regime, where d refers to

Under the column normalization condition, it turns outthat the metric entropy of this set with respect to theℓ2-norm is essentially no larger than the metric entropy ofBq(Rq), as summarized in the following

Lemma 3. Suppose thatX satisfies the column nor-malization condition (Assumption1 with constantκc)

and ǫ ∈ [UqRq1/q

(log dd

) 2−q2q , Rq

1/q]. Then there is aconstantU ′

q depending only onq ∈ (0, 1] such that

logN(ǫ, absconvq(X/√n)) ≤ U ′

q

[Rq

22−q

(κc

ǫ

) 2q2−q log d

].

The proof of this claim is provided in AppendixC.Note that apart from a different constant, this upperbound on the metric entropy is identical to that forlogN(ǫ;Bq(Rq)) from Lemma2.

D. Proof of lower bounds

We begin by proving our main results that providelower bounds on minimax rates, namely Theorems1and3.

1) Proof of Theorem1: Recall that forℓ2-norm error,the lower bounds in Theorem1 are the maximum oftwo expressions, one corresponding to the diameter ofthe setNq(X) intersected with theℓq-ball, and the othercorrespond to the metric entropy of theℓq-ball.

We begin by deriving the lower bound based on thediameter ofNq(X) = Bq(Rq) ∩ ker(X). The minimaxrate is lower bounded as

minβ

maxβ∈Bq(Rq)

‖β − β‖22 ≥ minβ

maxβ∈Nq(X)

‖β − β‖22,

where the inequality follows from the inclusionNq(X) ⊆ Bq(Rq). For any β ∈ Nq(X), wehave y = Xβ + w = w, so that y containsno information aboutβ ∈ Nq(X). Consequently,once β is chosen, there always exists an elementβ ∈ Nq(X) such that‖β − β‖2 ≥ 1

2 diam2(Nq(X)). In-deed, if ‖β‖2 ≥ 1

2 diam2(Nq(X)), then the adversarychoosesβ = 0 ∈ Nq(X). On the other hand, if‖β‖2 ≤ 1

2 diam2(Nq(X)), then there existsβ ∈ Nq(X)such that‖β‖2 = diam2(Nq(X)). By triangle inequality,we then have

‖β − β‖2 ≥ ‖β‖2 − ‖β‖2 ≥ 1

2diam2(Nq(X)).

Overall, we conclude that

minβ

maxβ∈Bq(Rq)

‖β − β‖22 ≥12diam2(Nq(X))

2.

In the following subsections, we follow steps (1)–(3)outlined earlier so as to obtain the second term in ourlower bounds on theℓ2-norm error and theℓ2-prediction

error. As has already been mentioned, steps (1) and (2)are general, whereas step (3) is different in each case.

a) Proof of Theorem1(a): Let M(δn,Bq(Rq)) bethe cardinality of a maximal packing of the ballBq(Rq)in theℓ2 metric, say with elementsβ1, . . . , βM. Then,by the standard arguments referred to earlier in step (1),we have

P[minβ

maxβ∈Bq(Rq)

‖β − β‖22 ≥ δ2n/4]≥ min

βP[β 6= B],

where the random vectorB ∈ Rd is uniformly dis-tributed over the packing setβ1, . . . , βM, and theestimatorβ takes values in the packing set. ApplyingFano’s inequality (step (2)) yields the lower bound

P[B 6= β] ≥ 1− I(y;B) + log 2

logM(δn,Bq(Rq)), (29)

where I(y;B) is the mutual information between ran-dom parameterB in the packing set and the observationvectory ∈ Rn.

It remains to upper bound the mutual information(step (3)); we do so using a procedure due to Yang andBarron [39]. It is based on covering the model spacePβ , β ∈ Bq(Rq) under the square-root Kullback-Leibler divergence. As noted prior to Lemma3, forthe Gaussian models given here, this square-root KLdivergence takes the form

√D(Pβ ‖Pβ′) =

1√2σ2

‖X(β − β′)‖2.

Let N(ǫn;Bq(Rq)) be the minimal cardinality of anǫn-covering of Bq(Rq) in ℓ2-norm. Using the up-per bound on the metric entropy ofabsconvq(X)provided by Lemma 3, we conclude that thereexists a set Xβ1, . . . , XβN such that for allXβ ∈ absconvq(X), there exists some indexi such that‖X(β − βi)‖2/

√n ≤ c κc ǫn for somec > 0. Following

the argument of Yang and Barron [39], we obtain thatthe mutual information is upper bounded as

I(y;B) ≤ logN(ǫn;Bq(Rq)) +c2 n

σ2κ2cǫ

2n.

Combining this upper bound with the Fano lowerbound (29) yields

P[B 6= β] ≥ 1− logN(ǫn;Bq(Rq)) +c2 nσ2 κ2

c ǫ2n + log 2

logM(δn;Bq(Rq)).

(30)

The final step is to choose the packing and covering radii(δn andǫn respectively) such that the lower bound (30)

13

Page 13: Minimax rates of estimation for ℓ -ballswainwrig/Papers/... · The area of high-dimensional statistical inference con-cerns the estimation in the d > n regime, where d refers to

is greater than1/2. In order to do so, suppose that wechoose the pair(ǫn, δn) such that

c2 n

σ2κ2c ǫ

2n ≤ logN(ǫn,Bq(Rq)), (31a)

logM(δn,Bq(Rq)) ≥ 4 logN(ǫn,Bq(Rq)). (31b)

As long asN(ǫn,Bq(Rq)) ≥ 2, we are then guaranteedthat

P[B 6= β] ≥ 1− logN(ǫn,Bq(Rq)) + log 2

4 logN(ǫn,Bq(Rq))≥ 1/2,

as desired.It remains to determine choices ofǫn and δn that

satisfy the relations (31). From Lemma2, relation (31a)is satisfied by choosingǫn such that c2 n

2σ2 κ2c ǫ

2n =

Lq

[Rq

22−q

(1ǫn

) 2q2−q log d

], or equivalently such that

(ǫn) 4

2−q = Θ(Rq

22−q

σ2

κ2c

log d

n

).

In order to satisfy the bound (31b), it suffices tochooseδn such that

Uq

[Rq

22−q

( 1

δn

) 2q2−q log d

]≥ 4Lq

[Rq

22−q

( 1

ǫn

) 2q2−q log d

],

or equivalently such that

δ2n ≤[ Uq

4Lq

] 2−qq

(ǫn) 4

2−q

2−q2

=[ Uq

4Lq

] 2−qq L

2−q2

q Rq

[σ2

κ2c

log d

n

] 2−q2

Substituting into equation (12), we obtain

P

[M2(Bq(Rq), X) ≥ cq Rq

(σ2

κ2c

log d

n

)1− q2

]≥ 1

2,

for some absolute constantcq. This completes the proofof Theorem1(a).

b) Proof of Theorem1(b): In order to prove The-orem1(b), we require some definitions and an auxiliarylemma. For any integers ∈ 1, . . . , d, we define theset

H(s) : =z ∈ −1, 0,+1d | ‖z‖0 = s

.

Although the setH depends ons, we frequently dropthis dependence so as to simplify notation. We definethe Hamming distanceρH(z, z′) =

∑dj=1 I[zj 6= z′j ] be-

tween the vectorsz andz′. Next we require the followingresult:

Lemma 4. For d, s even ands < 2d/3, there exists asubsetH ⊂ H with cardinality |H| ≥ exp( s2 log

d−ss/2 )

such thatρH(z, z′) ≥ s2 for all z, z′ ∈ H.

Note that ifd and/ors is odd, we can embedH into ad− 1 and/ors− 1-dimensional hypercube and the resultholds. Although results of this type are known (e.g., seeLemma 4, [6]), for completeness, we provide a proofof Lemma 4 in Appendix D. Now consider a rescaled

version of the setH, say√

2sδnH for someδn > 0 to

be chosen. For any elementsβ, β′ ∈√

2sδnH, we have

2

sδ2n × ρH(β, β′) ≤ ‖β − β′‖22 ≤ 8

sδ2n × ρH(β, β′).

Therefore by applying Lemma4 and noting thatρH(β, β′) ≤ s for all β, β′ ∈ H, we have the follow-ing bounds on theℓ2-norm of their difference for all

elementsβ, β′ ∈√

2sδnH:

‖β − β′‖22 ≥ δ2n, and (32a)

‖β − β′‖22 ≤ 8δ2n. (32b)

Consequently, the rescaled set√

2sδnH is anδn-packing

set of B0(s) in ℓ2 norm with M(δn,B0(s)) = |H|elements, sayβ1, . . . , βM. Using this packing set, wenow follow the same classical steps as in the proof ofTheorem1(a), up until the Fano lower bound (29) (steps(1) and (2)).

At this point, we use an alternative upper boundon the mutual information (step (3)), namely thebound I(y;B) ≤ 1

(M2 )

∑i6=j D(βi ‖βj), which fol-

lows from the convexity of mutual information [13].For the linear observation model (1), we haveD(βi ‖βj) = 1

2σ2 ‖X(βi − βj)‖22. Since (β − β′) ∈B0(2s) by construction, from the assumptions onX andthe upper bound bound (32b), we conclude that

I(y;B) ≤ 8nκ2u δ

2n

2σ2.

Substituting this upper bound into the Fano lowerbound (29), we obtain

P[B 6= β] ≥ 1−8nκ2

u

2σ2 δ2n + log(2)s2 log

d−ss/2

.

Setting δ2n = 116

σ2

κ2u

s2n log d−s

s/2 ensures that this proba-bility is at least1/2. Consequently, combined with thelower bound (12), we conclude that

P

[M2(B0(s), X) ≥ 1

16

(σ2

κ2u

s

2nlog

d− s

s/2

)]≥ 1/2.

As long as d/s ≥ 3/2, we are guaranteed thatlog(d/s− 1) ≥ c log(d/s) for some constantc > 0,from which the result follows.

14

Page 14: Minimax rates of estimation for ℓ -ballswainwrig/Papers/... · The area of high-dimensional statistical inference con-cerns the estimation in the d > n regime, where d refers to

2) Proof of Theorem3: We use arguments similarto the proof of Theorem1 in order to establish lowerbounds on prediction error‖X(β − β∗)‖2/

√n.

a) Proof of Theorem3(a): For some universalconstantc > 0 to be chosen, define

δ2n : = c Rq

(σ2

κ2c

log d

n

)1−q/2, (33)

and let β1, . . . , βM be an δn packing of theball Bq(Rq) in the ℓ2 metric, say with a total ofM(δn;Bq(Rq)) elements. We first show that ifn issufficiently large, then this set is also aκℓδn-packingset in the prediction (semi)-norm. From the theo-rem assumptions, we may choose universal constantsc1, c2 such thatfℓ(Rq, n, d) ≤ c2Rq

(log dn

)1−q/2and

Rq

(log dn

)1−q/2< c1. From Assumption2, for each

i 6= j, we are guaranteed that

‖X(βi − βj)‖2√n

≥ κℓ‖βi − βj‖2, (34)

as long as‖βi −βj‖2 ≥ fℓ(Rq, n, d). Consequently, forany fixedc > 0, we are guaranteed that

‖βi − βj‖2(i)

≥ δn(ii)

≥ c2Rq

( log dn

)1−q/2.

where inequality (i) follows sinceβjMj=1 is aδn-packing set. Here step (ii) follows because the theo-rem conditions imply that

Rq

( log dn

)1−q/2 ≤ √c1

[Rq

( log dn

)1−q/2]1/2

,

and we may choosec1 as as small as we please. (Notethat all of these statements hold for an arbitrarily smallchoice of c > 0, which we will choose later in theargument.) Sincefℓ(Rq, n, d) ≤ c2Rq

(log dn

)1−q/2

by assumption, the lower bound (34) guarantees thatβ1, β2, . . . , βM form a κℓδn-packing set in the pre-diction (semi)-norm‖X(βi − βj)‖2.

Given this packing set, we now follow a standardapproach, as in the proof of Theorem1(a), to reducethe problem of lower bounding the minimax error tothe error probability of a multi-way hypothesis testingproblem. After this step, we apply the Fano inequalityto lower bound this error probability via

P[XB 6= Xβ] ≥ 1− I(y;XB) + log 2

logM(δn;Bq(Rq)),

whereI(y;XB) now represents the mutual information5

5Despite the difference in notation, this mutual information is thesame asI(y;B), since it measures the information between theobservation vectory and the discrete indexi.

between random parameterXB (uniformly distributedover the packing set) and the observation vectory ∈ Rn.

From Lemma 3, the κc ǫ-covering number of theset absconvq(X) is upper bounded (up to a constantfactor) by theǫ covering number ofBq(Rq) in ℓ2-norm,which we denote byN(ǫn;Bq(Rq)). Following the samereasoning as in Theorem2(a), the mutual information isupper bounded as

I(y;XB) ≤ logN(ǫn;Bq(Rq)) +n

2σ2κ2c ǫ

2n.

Combined with the Fano lower bound,P[XB 6= Xβ] islower bounded by

1− logN(ǫn;Bq(Rq)) +nσ2 κ

2c ǫ

2n + log 2

logM(δn;Bq(Rq)). (35)

Lastly, we choose the packing and covering radii (δnand ǫn respectively) such that the lower bound (35)remains bounded below by1/2. As in the proof of Theo-rem1(a), it suffices to choose the pair(ǫn, δn) to satisfythe relations (31a) and (31b). The same choice ofǫnensures that relation (31a) holds; moreover, by makinga sufficiently small choice of the universal constantcin the definition (33) of δn, we may ensure that therelation (31b) also holds. Thus, as long asN2(ǫn) ≥ 2,we are then guaranteed that

P[XB 6= Xβ] ≥ 1− logN(δn;Bq(Rq)) + log 2

4 logN(δn;Bq(Rq))

≥ 1/2,

as desired.b) Proof of Theorem3(b): Recall the asser-

tion of Lemma 4, which guarantees the existence ofa set δ2n

2s H is an δn-packing set in ℓ2-norm withM(δn;Bq(Rq)) = |H| elements, sayβ1, . . . , βM,such that the bounds (32a) and (32b) hold, and suchthat log |H| ≥ s

2 logd−ss/2 . By construction, the difference

vectors(βi−βj) ∈ B0(2s), so that by Assumption3(a),we have

‖X(βi − βj)‖√n

≤ κu‖βi − βj‖2 ≤ κu

√8 δn. (36)

In the reverse direction, since Assumption3(b) holds,we have

‖X(βi − βj)‖2√n

≥ κ0,ℓδn. (37)

We can follow the same steps as in the proof of Theo-rem 1(b), thereby obtaining an upper bound the mutual

15

Page 15: Minimax rates of estimation for ℓ -ballswainwrig/Papers/... · The area of high-dimensional statistical inference con-cerns the estimation in the d > n regime, where d refers to

information of the formI(y;XB) ≤ 8κ2unδ

2n. Combined

with the Fano lower bound, we have

P[XB 6= Xβ] ≥ 1−8nκ2

u

2σ2 δ2n + log(2)s2n log d−s

s/2

.

Remembering the extra factor ofκℓ from the lowerbound (37), we obtain the lower bound

P

[Mn(B0(s), X) ≥ c′0,q κ2

σ2

κ2u

s logd− s

s/2

]≥ 1

2.

Repeating the argument from the proof of Theorem1(b)allows us to further lower bound this quantity in termsof log(d/s), leading to the claimed form of the bound.

E. Proof of achievability results

We now turn to the proofs of our main achievabilityresults, namely Theorems2 and 4, that provide upperbounds on minimax rates. We prove all parts of thesetheorems by analyzing the family ofM -estimators

β ∈ arg min‖β‖q

q≤Rq

‖y −Xβ‖22. (38)

Note that (38) is a non-convex optimization problemfor q ∈ [0, 1), so it is not an algorithm that wouldbe implemented in practice. Step (1) for upper boundsprovided above implies that if∆ = β − β∗, then

1

n‖X∆‖22 ≤ 2|wTX∆|

n. (39)

The remaining sections are devoted to step (2), whichinvolves controlling |wTX∆|

n for each of the upperbounds.

1) Proof of Theorem2: We begin with upper boundson the minimax rate in squaredℓ2-norm.

a) Proof of Theorem2(a): Recall that this part ofthe theorem deals with the caseq ∈ (0, 1]. We split ouranalysis into two cases, depending on whether the error‖∆‖2 is smaller or larger thanfℓ(Rq, n, d).

Case 1: First, suppose that‖∆‖2 < fℓ(Rq, n, d).Recall that the theorem is based on the assumptionRq

(log dn

)1−q/2< c2. As long as the constantc2 ≪ 1

is sufficiently small (but still independent of the triple(n, d,Rq)), we can assume that

c1Rq

( log dn

)1−q/2 ≤√

Rq

[κ2c

κ2ℓ

σ2

κ2ℓ

log d

n

]1/2−q/4

.

This inequality, combined with the assumptionfℓ(Rq, n, d) ≤ c1Rq

(log dn

)1−q/2imply that the error

‖∆‖2 satisfies the bound (15) for all c ≥ 1.

Case 2: Otherwise, we may assume that‖∆‖2 >fℓ(Rq, n, d). In this case, Assumption2 implies that‖X∆‖2

2

n ≥ κ2ℓ‖∆‖22, and hence, in conjunction with the

inequality (39), we obtain

κ2ℓ‖∆‖22 ≤ 2|wTX∆|/n ≤;

2

n‖wTX‖∞‖∆‖1.

Sincewi ∼ N(0, σ2) and the columns ofX are normal-ized, each entry of2nw

TX is zero-mean Gaussian vectorwith variance at most4σ2κ2

c/n. Therefore, by unionbound and standard Gaussian tail bounds, we obtain thatthe inequality

κ2ℓ‖∆‖22 ≤ 2σκc

√3 log d

n‖∆‖1 (40)

holds with probability greater than1− c1 exp(−c2 log d).

It remains to upper bound theℓ1-norm in terms of theℓ2-norm and a residual term. Since bothβ andβ∗ belongto Bq(Rq), we have‖∆‖qq =

∑dj=1 |∆j |q ≤ 2Rq. We

exploit the following lemma:

Lemma 5. For any vectorθ ∈ Bq(2Rq) and any positivenumberτ > 0, we have

‖θ‖1 ≤√

2Rqτ−q/2‖θ‖2 + 2Rqτ

1−q. (41)

Although this type of result is standard (e.g, [14]), weprovide a proof in AppendixE.

We can exploit Lemma5 by settingτ = 2σκc

κ2ℓ

√3 log d

n ,

thereby obtaining the bound‖∆‖22 ≤ τ‖∆‖1, and hence

‖∆‖22 ≤√

2Rqτ1−q/2‖∆‖2 + 2Rqτ

2−q.

Viewed as a quadratic in the indeterminatex = ‖∆‖2,this inequality is equivalent to the constraintg(x) =ax2 + bx+ c ≤ 0, with a = 1,

b = −√2Rqτ

1−q/2, and c = −2Rqτ2−q.

Sinceg(0) = c < 0 and the positive root ofg(x) occursat x∗ = (−b +

√b2 − 4ac)/(2a), some algebra shows

that we must have

‖∆‖22 ≤ 4maxb2, |c| ≤ 24Rq

[κ2c

κ2ℓ

σ2

κ2ℓ

log d

n

]1−q/2

,

with high probability (stated in Theorem2(a) whichcompletes the proof of Theorem2(a).

b) Proof of Theorem2(b): In order to estab-lish the bound (16), we follow the same steps withfℓ(s, n, d) = 0, thereby obtaining the following simpli-fied form of the bound (40):

‖∆‖22 ≤ κc

κℓ

σ

κℓ

√3 log d

n‖∆‖1.

16

Page 16: Minimax rates of estimation for ℓ -ballswainwrig/Papers/... · The area of high-dimensional statistical inference con-cerns the estimation in the d > n regime, where d refers to

By definition of the estimator, we have‖∆‖0 ≤ 2s, fromwhich we obtain‖∆‖1 ≤

√2s‖∆‖2. Canceling out a

factor of ‖∆‖2 from both sides yields the claim (16).

Establishing the sharper upper bound (17) requiresmore precise control on the right-hand side of theinequality (39). The following lemma, proved in Ap-pendixF, provides this control:

Lemma 6. Suppose that ‖Xθ‖2√n‖θ‖2

≤ κu for allθ ∈ B0(2s). Then there are universal positive constantsc1, c2 such that for anyr > 0, we have

sup‖θ‖0≤2s,‖θ‖2≤r

1

n

∣∣wTXθ∣∣ ≤ 6 σ r κu

√s log(d/s)

n

(42)

with probability greater than1− c1 exp(−c2 minn, s log(d/s)).

Lemma6 holds for a fixed radiusr, whereas we wouldlike to chooser = ‖∆‖2, which is a random quantity. Toextend Lemma6 so that it also applies uniformly overan interval of radii (and hence also to a random radiuswithin this interval), we use a “peeling” result, statedas Lemma9 in Appendix H. In particular, consider theeventE that there exists someθ ∈ B0(2s) such that

1

n

∣∣wTXθ∣∣ ≥ 6σκu‖θ‖2

√s log(d/s)

n. (43)

Then we claim that

P[E ] ≤ 2 exp(−c s log(d/s))

1− exp(−c s log(d/s))

for some c > 0. This claim follows fromLemma 9 in Appendix H by choosing the functionf(v;X) = 1

n |wTXv|, the set A = B0(2s), thesequencean = n, and the functionsρ(v) = ‖v‖2, and

g(r) = 6σrκu

√s log(d/s)

n . For anyr ≥ σκu

√s log(d/s)

n ,

we are guaranteed thatg(r) ≥ σ2κ2us log(d/s)

n , andµ = σ2κ2

us log(d/s)

n , so that Lemma9 may be applied.We use a similar peeling argument for two of our otherachievability results.

Returning to the main thread, we have

1

n‖X∆‖22 ≤ 6 σ ‖∆‖2 κu

√s log(d/s)

n,

with high probability. By Assumption3(b), we have‖X∆‖22/n ≥ κ2

ℓ‖∆‖22. Canceling out a factor of‖∆‖2and re-arranging yields‖∆‖2 ≤ 12 κuσ

κ2ℓ

√s log(d/s)

n withhigh probability as claimed.

2) Proof of Theorem4: We again make use of theelementary inequality (39) to establish upper bounds onthe prediction error.

a) Proof of Theorem4(a): So as to facilitate track-ing of constants in this part of the proof, we consider therescaled observation model, in whichw ∼ N(0, In) andX : = σ−1X. Note that ifX satisfies Assumption1 withconstantκc, thenX satisfies it with constantκc = κc/σ.Moreover, if we establish a bound on‖X(β− β∗)‖22/n,then multiplying byσ2 recovers a bound on the originalprediction loss.

We first deal with the caseq = 1. In particular, wehave

∣∣ 1nwT Xθ

∣∣ ≤ ‖ wT X

n‖∞‖θ‖1

√3κc

2σ2 log d

n(2R1),

where the second inequality holds with probabil-ity 1− c1 exp(−c2 log d), using standard Gaussian tailbounds. (In particular, since‖Xi‖2/

√n ≤ κc, the variate

wT Xi/n is zero-mean Gaussian with variance at mostκc

2/n.) This completes the proof forq = 1.Turning to the caseq ∈ (0, 1), in order to establish

upper bounds overBq(2Rq), we require the follow-ing analog of Lemma6, proved in AppendixG1. Soas to lighten notation, let us introduce the shorthandh(Rq, n, d) : =

√Rq ( log d

n )12−

q4 .

Lemma 7. For q ∈ (0, 1), suppose that there is auniversal constantc1 such thath(Rq, n, d) < c1 < 1.Then there are universal constantsci, i = 2, . . . , 5 suchthat for any fixed radiusr with r ≥ c2κc

q2 h(Rq, n, d),

we have

supθ∈Bq(2Rq)

‖Xθ‖2√n

≤r

1

n

∣∣wT Xθ∣∣ ≤ c3r κc

q2√

Rq (log d

n)

12−

q4 ,

with probability greater than1− c4 exp(−c5 nh2(Rq, n, d)).

Once again, we require the peeling result (Lemma9from Appendix H) to extend Lemma7 to hold forrandom radii. In this case, we define the eventE as thereexists someθ ∈ Bq(2Rq) such that

1

n

∣∣wT Xθ∣∣ ≥ c3

‖Xθ‖2√n

κc

q2√

Rq (log d

n)

12−

q4 .

By Lemma 9 with the choicesf(v;X) = |wTXv|/n,A = Bq(2Rq), an = n, ρ(v) = ‖Xv‖2√

n, and

17

Page 17: Minimax rates of estimation for ℓ -ballswainwrig/Papers/... · The area of high-dimensional statistical inference con-cerns the estimation in the d > n regime, where d refers to

g(r) = c3 r κc

q2 h(Rq, n, d), we have

P[E ] ≤ 2 exp(−c n h2(Rq, n, d))

1− exp(−c n h2(Rq, n, d)).

Returning to the main thread, from the basic inequal-ity (39), when the eventE from equation (43) holds, wehave

‖X∆‖22n

≤ ‖X∆‖2√n

√κc

qRq

( log dn

)1−q/2.

Canceling out a factor of‖X∆‖2√n

, squaring both sides,multiplying by σ2 and simplifying yields

‖X∆‖22n

≤ c2 σ2(κc

σ

)qRq

( log dn

)1−q/2

= c2 κ2c Rq

(σ2

κ2c

log d

n

)1−q/2,

as claimed.b) Proof of Theorem4(b): For this part, we require

the following lemma, proven in AppendixG2:

Lemma 8. As long as d2s ≥ 2, then for anyr > 0, we

have

supθ∈B0(2s)‖Xθ‖2√

n≤r

1

n

∣∣wTXθ∣∣ ≤ 9 r σ

√s log(ds )

n

with probability greater than1− exp(− 10s log( d

2s )).

By using a peeling technique (see Lemma9 in Ap-pendix H), we now extend the result to hold uniformlyover all radii. Consider the eventE that there exists someθ ∈ B0(2s) such that

1

n

∣∣wTXθ∣∣ ≥ 9σ

‖Xθ‖2√n

√s log(d/s)

n

.

We now apply Lemma9 with the sequencean = n, thefunction f(v;X) = 1

n |wTXv|, the setA = B0(2s), andthe functions

ρ(v) =‖Xv‖2√

n, andg(r) = 9 r κc

q2

√s log(d/s)

n.

We take r ≥ σκu

√s log(d/s)

n , which implies that

g(r) ≥ σ2κ2us log(d/s)

n , and µ = σ2κ2us log(d/s)

n inLemma9. Consequently, we are guaranteed that

P[E ] ≤ 2 exp(−c s log(d/s))

1− exp(−c s log(d/s)).

Combining this tail bound with the basic inequality (39),we conclude that

‖X∆‖22n

≤ 9‖X∆‖2√

√s log(ds )

n,

with high probability, from which the result follows.

IV. CONCLUSION

The main contribution of this paper is to obtainoptimal minimax rates of convergence for the linearmodel (1) under high-dimensional scaling, in which thesample sizen and problem dimensiond are allowedto scale, for general design matricesX. We providedmatching upper and lower bounds for theℓ2-norm andℓ2-prediction loss, so that the optimal minimax ratesare determined in these cases. To our knowledge, thisis the first paper to present minimax optimal rates inℓ2-prediction error for general design matricesX andgeneral q ∈ [0, 1]. We also derive optimal minimaxrates inℓ2-error, with similar rates obtained in concurrentwork by Zhang [42] under different conditions onX.

Apart from the rates themselves, our analysis high-lights how conditions on the design matrixX enterin complementary manners for theℓ2-norm and ℓ2-prediction loss functions. On one hand, it is possible toobtain lower bounds onℓ2-norm error (see Theorem1)or upper bounds onℓ2-prediction error (see Theorem4)under very mild assumptions onX—in particular, ouranalysis requires only that the columns ofX/

√n have

boundedℓ2-norms (see Assumption1). On the otherhand, in order to obtain upper bounds onℓ2-norm error(Theorem 2) or lower bound onℓ2-norm predictionerror (Theorem3), the design matrixX must satisfy,in addition to column normalization, other more restric-tive conditions. Indeed both lower bounds in predictionerror and upper bounds inℓ2-norm error require thatelements ofBq(Rq) are well separated in predictionsemi-norm‖X(·)‖2/

√n. In particular, our analysis was

based on imposed on a certain type of restricted lowereigenvalue condition onXTX measured over theℓq-ball, as formalized in Assumption2. As shown by ourresults, this lower bound is intimately related to thedegree of non-identifiability over theℓq-ball of the high-dimensional linear regression model. Finally, we notethat similar techniques can be used to obtain minimax-optimal rates for more general problems of sparse non-parametric regression [28].

APPENDIX

A. Proof of Proposition1

Under the stated conditions, Theorem 1 from Raskuttiet al. [29] guarantees that

‖Xθ‖2√n

≥ λmin(Σ1/2)

4‖θ‖2 − 9

(ρ2(Σ) log dn

)1/2 ‖θ‖1(44)

18

Page 18: Minimax rates of estimation for ℓ -ballswainwrig/Papers/... · The area of high-dimensional statistical inference con-cerns the estimation in the d > n regime, where d refers to

for all θ ∈ Rd, with probability greater than1− c1 exp(−c2n). When θ ∈ Bq(2Rq), Lemma 5guarantees that

‖θ‖1 ≤√

2Rqτ−q/2‖θ‖2 + 2Rqτ

1−q

for all τ > 0. We now setτ =√

log dn and substitute

the result into the lower bound (44). Following somealgebra, we find that‖Xθ‖2√

nis lower bounded by

λmin(Σ1/2)

4− 18ρ(Σ)

√Rq

( log dn

)1/2−q/4‖θ‖2

− 18 Rqρ(Σ)( log d

n

)1−q/2.

Consequently, as long as

λmin(Σ1/2)

8> 18ρ(Σ)

√Rq

( log dn

)1/2−q/4,

then we are guaranteed that

‖Xθ‖2√n

≥ λmin(Σ1/2)

4‖θ‖2 − 18 Rqρ(Σ)

( log dn

)1−q/2

for all θ ∈ Bq(2Rq) as claimed.

B. Proof of Lemma2

The result is obtained by inverting known results on(dyadic) entropy numbers ofℓq-balls; there are someminor technical subtleties in performing the inversion.For a d-dimensional ℓq ball with q ∈ (0, 1], it isknown [17, 21, 31] that for all integersk ∈ [log d, d],the dyadic entropy numbersǫk of the ball Bq(1) withrespect to theℓ2-norm scale as

ǫk(Bq(1)) = Cq

( log(1 + dk )

k

)1/q−1/2. (45)

Moreover, fork ∈ [1, log d], we haveǫk(Bq(1)) ≤ Cq.We first establish the upper bound on the metric

entropy. Sinced ≥ 2, we have

ǫk(Bq(1)) ≤ Cq

( log(1 + d2 )

k

)1/q−1/2

≤ Cq

( log dk

)1/q−1/2.

Inverting this inequality fork = logN(ǫ;Bq(1)) andallowing for a ball radiusRq yields

logN(ǫ;Bq(Rq)) ≤(Cq

Rq1/q

ǫ

) 2q2−q log d, (46)

as claimed. The conditions ǫ ≤ Rq1/q and

ǫ ≥ CqRq1/q

(log dd

) 2−q2q ensure thatk ∈ [log d, d].

We now turn to proving the lower bound on the metricentropy, for which we require the existence of some fixed

ν ∈ (0, 1) such thatk ≤ d1−ν . Under this assumption,we have1 + d

k ≥ dk ≥ dν , and hence

Cq

( log(1 + dk )

k

)1/q−1/2 ≥ Cq

(ν log dk

)1/q−1/2.

Accounting for the radiusRq (as was done for the upperbound) yields

logN(ǫ;Bq(Rq)) ≥ ν(CqRq

1/q

ǫ

) 2q2−q log d,

as claimed.Finally, let us check that our assumptions onk needed

to perform the inversion are ensured by the conditionsthat we have imposed onǫ. The conditionk ≥ log dis ensured by settingǫ < 1. Turning to the conditionk ≤ d1−ν , from the bound (46) on k, it suffices

to chooseǫ such that(CqRq

1/q

ǫ

) 2q2−q log d ≤ d1−ν .

This condition is ensured by enforcing the lower bound

ǫ2 = Ω(Rq

2/(2−q) log dd1−ν

) 2−qq for someν ∈ (0, 1).

C. Proof of Lemma3

We deal first with (dyadic) entropy numbers,as previously defined (25), and show thatǫ2k−1(absconvq(X/

√n), ‖ · ‖2) is upper bounded

by

c κc min

1,( log(1 + d

k )

k

) 1q− 1

2

. (47)

We prove this intermediate claim by combininga number of known results on the behavior ofdyadic entropy numbers. First, using Corollary 9 fromGuedon and Litvak [17], for all k = 1, 2, . . .,ǫ2k−1(absconvq(X/

√n), ‖ · ‖2) is upper bounded as

follows:

c ǫk(absconv1(X/√n), ‖ · ‖2) min

1,( log(1 + d

k )

k

) 1q−1

.

Using Corollary 2.4 from Carl and Pajor [9],ǫk(absconv1(X/

√n), ‖ · ‖2) is upper bounded as fol-

lows:

c√n|||X|||1→2 min

1,( log(1 + d

k )

k

)1/2,

where |||X|||1→2 denotes the norm ofX viewed as anoperator fromℓd1 → ℓn2 . More specifically, we have

1√n|||X|||1→2 =

1√n

sup‖u‖1=1

‖Xu‖2

=1√n

sup‖v‖2=1

sup‖u‖1=1

vTXu

= maxi=1,...,d

‖Xi‖2/√n ≤ κc.

19

Page 19: Minimax rates of estimation for ℓ -ballswainwrig/Papers/... · The area of high-dimensional statistical inference con-cerns the estimation in the d > n regime, where d refers to

Overall, we have shown thatǫ2k−1(absconvq(X√n), ‖·‖2)

is upper bounded by

c κc min

1,( log(1 + d

k )

k

) 1q− 1

2

,

as claimed. Finally, under the stated assumptions, wemay invert the upper bound (47) by the same procedureas in the proof of Lemma2 (see AppendixB), therebyobtaining the claim.

D. Proof of Lemma4

Our proof is inspired by related results from theapproximation theory literature (see, e.g., Kuhn [21]and Birge and Massart [6]). For each even integers = 2, 4, 6, . . . , d, let us define the set

H : =z ∈ −1, 0,+1d | ‖z‖0 = s

. (48)

Note that the cardinality of this set is|H| =(ds

)2s, and

moreover, we have‖z−z′‖0 ≤ 2s for all pairsz, z′ ∈ H.We now define the Hamming distanceρH onH×H viaρH(z, z′) =

∑dj=1 I[zj 6= z′j ]. For some fixed element

z ∈ H, consider the setz′ ∈ H | ρH(z, z′) ≤ s/2.Note that its cardinality is upper bounded as

∣∣z′ ∈ H | ρH(z, z′) ≤ s/2∣∣ ≤

(d

s/2

)3s/2.

To see this, note that we simply choose a subset of sizes/2 wherez andz′ agree and then choose the others/2co-ordinates arbitrarily.

Now consider a setA ⊂ H with cardinality at most

|A| ≤ m : =(ds)( ds/2)

. The set of elementsz ∈ H that are

within Hamming distances/2 of some element ofA hascardinality at most

|z ∈ H | : ρH(z, z′) ≤ s/2 for somez′ ∈ A|

≤ |A|(

d

s/2

)3s/2 < |H|,

where the final inequality holds sincem(

ds/2

)3s/2 < |H|. Consequently, for any such

set with cardinality|A| ≤ m, there exists az ∈ H suchthat ρH(z, z′) > s/2 for all z′ ∈ A. By inductivelyadding this element at each round, we then create a setwith A ⊂ H with |A| > m such thatρH(z, z′) > s/2for all z, z′ ∈ A.

To conclude, let us lower bound the cardinalitym. Wehave

m =

(ds

)(

ds/2

) =(d− s/2)! (s/2)!

(d− s)! s!

=

s/2∏

j=1

d− s+ j

s/2 + j≥

(d− s/2

s

)s/2,

where the final inequality uses the fact that the ratiod−s+js/2+j is decreasing as a function ofj (see Kuhn [21]pp. 122–123 and Birge and Massart [6], Lemma 4 fordetails).

E. Proof of Lemma5

Defining the setS = j | |θj | > τ, we have

‖θ‖1 = ‖θS‖1 +∑

j /∈S

|θj | ≤√|S|‖θ‖2 + τ

j /∈S

|θj |τ

.

Since|θj |/τ ≤ 1 for all i /∈ S, we obtain

‖θ‖1 ≤√|S|‖θ‖2 + τ

j /∈S

(|θi|/τ

)q

≤√|S|‖θ‖2 + 2Rqτ

1−q.

Finally, we observe2Rq ≥ ∑j∈S |θj |q ≥ |S|τ q, from

which the result follows.

F. Proof of Lemma6

For a given radiusr > 0, define the set

S(s, r) : =θ ∈ Rd | ‖θ‖0 ≤ 2s, ‖θ‖2 ≤ r

,

and the random variablesZn = Zn(s, r) given by

Zn(s, r) : = supθ∈S(s,r)

1

n|wTXθ|.

For a givenǫ ∈ (0, 1) to be chosen, let us upper boundthe minimal cardinality of a set that coversS(s, r) upto (rǫ)-accuracy inℓ2-norm. We claim that we mayfind such a covering setθ1, . . . , θN ⊂ S(s, r) withcardinalityN = N(ǫ, S(s, r)) that is upper bounded as

logN(ǫ; S(s, r)) ≤ log

(d

2s

)+ 2s log(1/ǫ).

To establish this claim, note that here are(d2s

)subsets of

size2s within 1, 2, . . . , d. Moreover, for any2s-sizedsubset, there is an(rǫ)-covering inℓ2-norm of the ballB2(r) with at most22s log(1/ǫ) elements (e.g., [23]).

Consequently, for eachθ ∈ S(s, r), we may find someθk such that‖θ − θk‖2 ≤ rǫ. By triangle inequality, wethen have

1

n|wTXθ| ≤ 1

n|wTXθk|+ 1

n|wTX(θ − θi)|

≤ 1

n|wTXθk|+ ‖w‖2√

n

‖X(θ − θk)‖2√n

.

Given the assumptions onX, we have ‖X(θ −θk)‖2/

√n ≤ κu‖θ − θk‖2 ≤ κu rǫ. Moreover, since

the variate‖w‖22/σ2 is χ2 with n degrees of freedom,we have‖w‖2√

n≤ 2σ with probability 1− c1 exp(−c2n),

20

Page 20: Minimax rates of estimation for ℓ -ballswainwrig/Papers/... · The area of high-dimensional statistical inference con-cerns the estimation in the d > n regime, where d refers to

using standard tail bounds (see AppendixI). Puttingtogether the pieces, we conclude that

1

n|wTXθ| ≤ 1

n|wTXθk|+ 2κu σ r ǫ

with high probability. Taking the supremum overθ onboth sides yields

Zn ≤ maxk=1,2,...,N

1

n|wTXθk|+ 2κu σ r ǫ.

It remains to bound the finite maximum over thecovering set. We begin by observing that each vari-ate wTXθk/n is zero-mean Gaussian with varianceσ2‖Xθi‖22/n2. Under the given conditions onθk andX,this variance is at mostσ2κ2

ur2/n, so that by standard

Gaussian tail bounds, we conclude that

Zn ≤ σ r κu

√3 logN(ǫ; S(s, r))

n+ 2κu σr ǫ

= σ r κu

√3 logN(ǫ; S(s, r))

n+ 2ǫ

. (49)

with probability greater than1− c1 exp(−c2 logN(ǫ; S(s, r))).

Finally, suppose thatǫ =√

s log(d/2s)n . With this

choice and recalling thatn ≤ d by assumption, we obtain

logN(ǫ; S(s, r))

n≤ log

(d2s

)

n+

s log ns log(d/2s)

n

≤ log(d2s

)

n+

s log(d/s)

n

≤ 2s+ 2s log(d/s)

n+

s log(d/s)

n,

where the final line uses standard bounds on bino-mial coefficients. Sinced/s ≥ 2 by assumption,we conclude that our choice ofǫ guarantees thatlogN(ǫ;S(s,r))

n ≤ 5 s log(d/s). Substituting these rela-tions into the inequality (49), we conclude that

Zn ≤ σ r κu

4

√s log(d/s)

n+ 2

√s log(d/s)

n

,

as claimed. SincelogN(ǫ; S(s, r)) ≥ cs log(d/s), thisevent occurs with probability at least

1− c1 exp(−c2 minn, s log(d/s)),

as claimed.

G. Proofs for Theorem4

This appendix is devoted to the proofs of technicallemmas used in Theorem4.

1) Proof of Lemma7: For q ∈ (0, 1), let us definethe set

Sq(Rq, r) : = Bq(2Rq) ∩θ ∈ Rd | ‖Xθ‖2/

√n ≤ r

.

We seek to bound the random variableZ(Rq, r) : = supθ∈Sq(Rq,r)

1n |wT Xθ|, which we do

by a chaining result—in particular, Lemma 3.2 in vande Geer [33]). Adopting the notation from this lemma,we seek to apply it withǫ = δ/2, andK = 4. Supposethat ‖Xθ‖2√

n≤ r, and

√nδ ≥ c1r (50a)

√nδ ≥ c1

∫ r

δ16

√logN(t; Sq)dt = : J(r, δ). (50b)

whereN(t; Sq; ‖.‖2/√n) is the covering number forSq

in the ℓ2-prediction norm (defined by‖Xθ‖/√n). Aslong as ‖w‖2

2

n ≤ 16, Lemma 3.2 guarantees that

P[Z(Rq, r) ≥ δ,

‖w‖22n

≤ 16] ≤ c1 exp (−c2nδ2

r2).

By tail bounds onχ2 random variables (see AppendixI),we haveP[‖w‖22 ≥ 16n] ≤ c4 exp(−c5n). Consequently,we conclude that

P[Z(Rq, r) ≥ δ] ≤ c1 exp (−c2

nδ2

r2) + c4 exp(−c5n)

For somec3 > 0, let us set

δ = c3 r κc

q2√

Rq (log d

n)

12−

q4 ,

and let us verify that the conditions (50a) and (50b) hold.Given our choice ofδ, we find that

δ

r

√n = Ω(nq/4(log d)1/2−q/4).

By the condition (11), the dimensiond is lower boundedso that condition (50a) holds. Turning to verification ofthe inequality (50b), we first provide an upper boundfor logN(Sq, t). Settingγ = Xθ√

nand from the defini-

tion (28) of absconvq(X/√n), we have

supθ∈Sq(Rq,r)

1

n|wT Xθ| ≤ sup

γ∈absconvq(X/√n)

‖γ‖2≤r

1√n|wT γ|.

We may apply the bound in Lemma3 toconclude that logN(ǫ; Sq) is upper bounded by

c′ Rq2

2−q(κc

ǫ

) 2q2−q log d.

Using this upper bound, we have

J(r, δ) : =

∫ r

δ/16

√logN(Sq, t)dt

≤ c Rq1

2−q κc

q2−q

√log d

∫ r

δ/16

t−q/(2−q)dt

= c′Rq1

2−q κc

q2−q

√log d r1−

q2−q .

21

Page 21: Minimax rates of estimation for ℓ -ballswainwrig/Papers/... · The area of high-dimensional statistical inference con-cerns the estimation in the d > n regime, where d refers to

Using the definition ofδ, the condition (11), and thecondition r = Ω(κc

q2√

Rq ( log dn )

12−

q4 ), it is straight-

forward to verify that(δ/16, r) lies in the range ofǫspecified in Lemma3.

Using this upper bound, let us verify that the inequal-ity (50b) holds as long asr = Ω(κc

q2√

Rq ( log dn )

12−

q4 ),

as assumed in the statement of Lemma7. With ourchoice ofδ, we have

J√n δ

≤c′Rq

12−q κc

q2−q

√log dn r1−

q2−q

c3 r κc

q2√

Rq ( log dn )

12−

q4

=c′(Rqκc

q( log dn

) q2 )

12−q− 1

2−q

2 (2−q)

c3

=c′

c3,

so that condition (50b) will hold as long as we choosec3 > 0 large enough. Overall, we conclude that

P[Z(Rq, r) ≥ c3 r κc

q2√

Rq (log d

n)

12−

q4 ]

is at mostc1 exp(−Rq(log d)1− q

2nq2 ), which concludes

the proof.2) Proof of Lemma8: First, consider a fixed subset

S ⊂ 1, 2, . . . , d of cardinality |S| = s. Applyingthe SVD to the sub-matrixXS ∈ Rn×s, we haveXS = V DU , where V ∈ Rn×s has orthonormalcolumns, andDU ∈ Rs×s. By construction, for any∆S ∈ Rs, we have‖XS∆S‖2 = ‖DU∆S‖2. SinceVhas orthonormal columns, the vectorwS = V Tw ∈ Rs

has i.i.d.N(0, σ2) entries. Consequently, for any∆S

such that‖XS∆S‖2√n

≤ r, we have

∣∣∣wTXS∆S

n

∣∣∣ =∣∣∣ w

TS√n

DU∆S√n

∣∣∣

≤ ‖wS‖2√n

‖DU∆S‖2√n

≤ ‖wS‖2√n

r.

Now the variateσ−2‖wS‖22 is χ2 with s degrees of free-dom, so that by standardχ2 tail bounds (see AppendixI),we have

P[‖wS‖22

σ2s≥ 1 + 4δ

]≤ exp(−sδ), valid for all δ ≥ 1.

Settingδ = 20 log( d2s ) and noting thatlog( d

2s ) ≥ log 2by assumption, we have (after some algebra)

P

[‖wS‖22n

≥ σ2s

n

(81 log(d/s)

)]≤ exp(−20s log(

d

2s)).

We have thus shown that for each fixed subset, we havethe bound

∣∣∣wTXS∆S

n

∣∣∣ ≤ r

√81σ2s log( d

2s )

n,

with probability at least1− exp(−20s log( d2s )).

Since there are(d2s

)≤

(de2s )

2s subsets of sizes,applying a union bound yields that

P

[sup

θ∈B0(2s),‖Xθ‖2√

n≤r

|wTXθ

n| ≥ r

√81σ2s log( d

2s )

n

]

≤ exp(− 20s log(

d

2s) + 2s log

de

2s

)

≤ exp(− 10s log(

d

2s)),

as claimed.

H. Peeling argument

In this appendix, we state a tail bound for the con-strained optimum of a random objective function ofthe form f(v;X), where v ∈ Rd is the vector to beoptimized over, andX is some random vector. Moreprecisely, consider the problemsupρ(v)≤r, v∈A f(v;X),whereρ : Rd → R+ is some non-negative and increasingconstraint function, andA is a non-empty set. Our goalis to bound the probability of the event defined by

E : =X ∈ Rn×d | ∃ v ∈ A s. t. f(v;X) ≥ 2g(ρ(v))

,

whereg : R → R is non-negative and strictly increasing.

Lemma 9. Suppose thatg(r) ≥ µ for all r ≥ 0, andthat there exists some constantc > 0 such that for allr > 0, we have the tail bound

P[

supv∈A, ρ(v)≤r

f(v;X) ≥ g(r)] ≤ 2 exp(−c an g(r)),

(51)

for somean > 0. Then we have

P[E ] ≤ 2 exp(−4canµ)

1− exp(−4canµ).

Proof: Our proof is based on a standard peelingtechnique (e.g., [2], [33], p. 82). By assumption, asvvaries overA, we haveg(r) ∈ [µ,∞). Accordingly, form = 1, 2, . . ., defining the sets

Am : =v ∈ A | 2m−1µ ≤ g(ρ(v)) ≤ 2mµ

,

22

Page 22: Minimax rates of estimation for ℓ -ballswainwrig/Papers/... · The area of high-dimensional statistical inference con-cerns the estimation in the d > n regime, where d refers to

we may conclude that if there existsv ∈ A such thatf(v,X) ≥ 2g(ρ(v)), then this must occur for somemandv ∈ Am. By union bound, we have

P[E ] ≤∞∑

m=1

P[∃ v ∈ Am such thatf(v,X) ≥ 2g(ρ(v))

].

If v ∈ Am andf(v,X) ≥ 2g(ρ(v)), then by definition ofAm, we havef(v,X) ≥ 2 (2m−1)µ = 2mµ. Since forany v ∈ Am, we haveg(ρ(v)) ≤ 2mµ, we combinethese inequalities to obtain

P[E ] ≤∞∑

m=1

P[

supρ(v)≤g−1(2mµ)

f(v,X) ≥ 2mµ]

≤∞∑

m=1

2 exp(− can [g(g−1(2mµ))]

)

= 2∞∑

m=1

exp(− can 2mµ

),

from which the stated claim follows by upper boundingthis geometric sum.

I. Some tail bounds forχ2-variates

The following large-deviations bounds for centralizedχ2 are taken from Laurent and Massart [22]. Given acentralizedχ2-variate Z with m degrees of freedom,then for allx ≥ 0,

P[Z −m ≥ 2

√mx+ 2x

]≤ exp(−x),and (52a)

P[Z −m ≤ −2

√mx

]≤ exp(−x). (52b)

The following consequence of this bound is useful: fort ≥ 1, we have

P[Z −m

m≥ 4t

]≤ exp(−mt). (53)

Starting with the bound (52a), settingx = tm yieldsP[Z−mm ≥ 2

√t+2t

]≤ exp(−tm), Since4t ≥ 2

√t+2t

for t ≥ 1, we haveP[Z−mm ≥ 4t] ≤ exp(−tm) for all

t ≥ 1.

REFERENCES

[1] M. Akcakaya and V. Tarokh. Shannon theoreticlimits on noisy compressive sampling.IEEE Trans.Info Theory, 56(1):492–504, January 2010.

[2] K. S. Alexander. Rates of growth for weightedempirical processes. InProceedings of the BerkeleyConference in Honor of Jerzy Neyman and JackKiefer, pages 475—493. UC Press, Berkeley, 1985.

[3] A. R. Barron, L. Birge, and P. Massart. Risk boundsfor model selection by penalization.ProbabilityTheory and Related Fields, 113:301–413, 1999.

[4] P. Bickel, Y. Ritov, and A. Tsybakov. Simultaneousanalysis of Lasso and Dantzig selector.Annals ofStatistics, 37(4):1705–1732, 2009.

[5] L. Birge. Approximation dans les espacesmetriques et theorie de l’estimation.Z. Wahrsch.verw. Gebiete, 65:181–327, 1983.

[6] L. Birge and P. Massart. Gaussian model selection.J. Eur. Math. Soc., 3:203–268, 2001.

[7] F. Bunea, A. Tsybakov, and M. Wegkamp. Aggre-gation for Gaussian regression.Annals of Statistics,35(4):1674–1697, 2007.

[8] E. Candes and T. Tao. The Dantzig selector:Statistical estimation whenp is much larger thann. Annals of Statistics, 35(6):2313–2351, 2007.

[9] B. Carl and A. Pajor. Gelfand numbers of operatorswith values in a Hilbert space.Invent. math., 94:479–504, 1988.

[10] B. Carl and I. Stephani.Entropy, compactness andthe approximation of operators. Cambridge Tractsin Mathematics. Cambridge University Press, Cam-bridge, UK, 1990.

[11] S. Chen, D. L. Donoho, and M. A. Saunders.Atomic decomposition by basis pursuit.SIAM J.Sci. Computing, 20(1):33–61, 1998.

[12] A. Cohen, W. Dahmen, and R. DeVore. Com-pressed sensing and best k-term approximation.J.of. American Mathematical Society, 22(1):211–231,January 2009.

[13] T.M. Cover and J.A. Thomas.Elements of Infor-mation Theory. John Wiley and Sons, New York,1991.

[14] D. L. Donoho and I. M. Johnstone. Minimax riskoverℓp-balls forℓq-error.Prob. Theory and RelatedFields, 99:277–303, 1994.

[15] A. K. Fletcher, S. Rangan, and V. K. Goyal.Necessary and sufficient conditions for sparsitypattern recovery.IEEE Transactions on InformationTheory, 55(12):5758–5772, 2009.

[16] E. Greenshtein and Y. Ritov. Persistency in high di-mensional linear predictor-selection and the virtueof over-parametrization. Bernoulli, 10:971–988,2004.

[17] O. Guedon and A. E. Litvak. Euclidean projectionsof a p-convex body. InGeometric aspects offunctional analysis, pages 95–108. Springer-Verlag,2000.

[18] T. S. Han and S. Verdu. Generalizing the Fanoinequality. IEEE Transactions on Information The-ory, 40:1247–1251, 1994.

[19] R. Z. Has’minskii. A lower bound on the risks ofnonparametric estimates of densities in the uniform

23

Page 23: Minimax rates of estimation for ℓ -ballswainwrig/Papers/... · The area of high-dimensional statistical inference con-cerns the estimation in the d > n regime, where d refers to

metric. Theory Prob. Appl., 23:794–798, 1978.[20] I. A. Ibragimov and R. Z. Has’minskii.Statistical

Estimation: Asymptotic Theory. Springer-Verlag,New York, 1981.

[21] T. Kuhn. A lower estimate for entropy numbers.Journal of Approximation Theory, 110:120–124,2001.

[22] B. Laurent and P. Massart. Adaptive estimation ofa quadratic functional by model selection.Annalsof Statistics, 28(5):1303–1338, 1998.

[23] J. Matousek. Lectures on discrete geometry.Springer-Verlag, New York, 2002.

[24] N. Meinshausen and P. Buhlmann. High-dimensional graphs and variable selection with theLasso.Annals of Statistics, 34(3):1436–1462, 2006.

[25] N. Meinshausen and B.Yu. Lasso-type recoveryof sparse representations for high-dimensional data.Annals of Statistics, 37(1):246–270, 2009.

[26] S. Negahban, P. Ravikumar, M. Wainwright, andB. Yu. A unified framework for high-dimensionalanalysis of M-estimators with decomposable regu-larizers. InProceedings of NIPS, 2009. Full lengthversion: arxiv.org/abs/1010.2731v1.

[27] D. Pollard. Convergence of Stochastic Processes.Springer-Verlag, New York, 1984.

[28] G. Raskutti, M. J. Wainwright, and B. Yu.Minimax-optimal rates for sparse additive modelsover kernel classes via convex programming. Tech-nical Report http://arxiv.org/abs/1008.3654, UCBerkeley, Department of Statistics, August 2010.

[29] G. Raskutti, M. J. Wainwright, and B. Yu. Re-stricted eigenvalue properties for correlated gaus-sian designs. Journal of Machine Learning Re-search, 11:2241–2259, August 2010.

[30] P. Rigollet and A. B. Tsybakov. Exponentialscreening and optimal rates of sparse estimation.Technical report, Princeton, 2010.

[31] C. Schutt. Entropy numbers of diagonal operatorsbetween symmetric Banach spaces.Journal ofApproximation Theory, 40:121–128, 1984.

[32] R. Tibshirani. Regression shrinkage and selectionvia the Lasso. Journal of the Royal StatisticalSociety, Series B, 58(1):267–288, 1996.

[33] S. van de Geer. Empirical Processes in M-Estimation. Cambridge University Press, 2000.

[34] S. van de Geer. The deterministic Lasso. InProc.of Joint Statistical Meeting, 2007.

[35] S. van de Geer and J.-M. Loubes. Adaptiveestimation in regression, using soft thresholdingtype penalties.Statistica Neerlandica, 56:453–478,2002.

[36] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery usingℓ1-constrained quadratic programming (Lasso).IEEETrans. Information Theory, 55:2183–2202, May2009.

[37] M. J. Wainwright. Information-theoretic bounds forsparsity recovery in the high-dimensional and noisysetting. IEEE Trans. Information Theory, 55(12):5728–5741, 2009.

[38] W. Wang, M. J. Wainwright, and K. Ramchandran.Information-theoretic limits on sparse signal recov-ery: Dense versus sparse measurement matrices.IEEE Trans. Info Theory, 56(6):2967–2979, June2010.

[39] Y. Yang and A. Barron. Information-theoreticdetermination of minimax rates of convergence.Annals of Statistics, 27(5):1564–1599, 1999.

[40] B. Yu. Assouad, Fano and Le Cam.ResearchPapers in Probability and Statistics: Festschrift inHonor of Lucien Le Cam, pages 423–435, 1996.

[41] C. H. Zhang and J. Huang. The sparsity and biasof the Lasso selection in high-dimensional linearregression.Annals of Statistics, 36(4):1567–1594,2008.

[42] C.H. Zhang. Nearly unbiased variable selection un-der minimax concave penalty.Annals of Statistics,2010. To appear.

[43] P. Zhao and B. Yu. On model selection consistencyof Lasso. Journal of Machine Learning Research,7:2541–2567, 2006.

Biographies

• Garvesh Raskutti is currently a PhD candidate inthe Department of Statistics at the University ofCalifornia, Berkeley. He received his Bachelor ofScience (majors in Mathematics and Statistics) andBachelor of Engineering (major in Electrical En-gineering), and Masters of Engineering Science atthe University of Melbourne. His research interestsinclude statistical machine learning, mathematicalstatistics, convex optimization, information theory,and high-dimensional statistics. He was awarded aBerkeley Graduate Fellowship from 2007-09.

• Martin Wainwright is currently a professor at Uni-versity of California at Berkeley, with a joint ap-pointment between the Department of Statisticsand the Department of Electrical Engineering andComputer Sciences. He received a Bachelor’s de-gree in Mathematics from University of Waterloo,Canada, and Ph.D. degree in Electrical Engineeringand Computer Science (EECS) from Massachusetts

24

Page 24: Minimax rates of estimation for ℓ -ballswainwrig/Papers/... · The area of high-dimensional statistical inference con-cerns the estimation in the d > n regime, where d refers to

Institute of Technology (MIT). His research inter-ests include statistical signal processing, coding andinformation theory, statistical machine learning, andhigh-dimensional statistics. He has been awardedan Alfred P. Sloan Foundation Fellowship, an NSFCAREER Award, the George M. Sprowls Prize forhis dissertation research (EECS department, MIT),a Natural Sciences and Engineering Research Coun-cil of Canada 1967 Fellowship, an IEEE SignalProcessing Society Best Paper Award in 2008, andseveral outstanding conference paper awards.

• Bin Yu is Chancellor’s Professor in the Depart-ments of Statistics and Electrical Engineering andComputer Science at UC Berkeley. She is currentlythe chair of department of Statistics, and a found-ing co-director of the Microsoft Lab on Statisticsand Information Technology at Peking University,China. Her Ph.D. is in Statistics from UC Berke-ley, and she has published over 80 papers in awide range of research areas including empiricalprocess theory, information theory (MDL), MCMCmethods, signal processing, machine learning, highdimensional data inference (boosting and Lassoand sparse modeling in general), bioinformatics,and remotes sensing. Her current research interestsinclude statistical machine learning for high dimen-sional data and solving data problems from remotesensing, neuroscience, and text documents.She was a 2006 Guggenheim Fellow, and is aFellow of AAAS, IEEE, IMS (Institute of Math-ematical Statistics) and ASA (American StatisticalAssociation). She is a co-chair of National ScientificCommittee of SAMSI and on the Board of Mathe-matical Sciences and Applications of the NationalAcademy of Sciences in the US. She has beeninvited to deliver the 2012 Tukey Memorial Lecturein Statistics at the 8th World Congress in Probabilityand Statistics, the quadrennial joint conference ofthe Bernoulli Society and IMS in Istanbul, Turkey.

25