high dimensional statistics - mit mathematicsrigollet/pdfs/rignotes17.pdf · high-dimensional...

High Dimensional Statistics

Lecture Notes

(This version: November 5, 2019)

Philippe Rigollet and Jan-Christian Hutter

Spring 2017

Preface

These lecture notes were written for the course 18.657, High DimensionalStatistics at MIT. They build on a set of notes that was prepared at Prince-ton University in 2013-14 that was modified (and hopefully improved) over theyears.

Over the past decade, statistics have undergone drastic changes with thedevelopment of high-dimensional statistical inference. Indeed, on each indi-vidual, more and more features are measured to a point that their numberusually far exceeds the number of observations. This is the case in biology andspecifically genetics where millions of (combinations of) genes are measuredfor a single individual. High resolution imaging, finance, online advertising,climate studies . . . the list of intensive data producing fields is too long to beestablished exhaustively. Clearly not all measured features are relevant for agiven task and most of them are simply noise. But which ones? What can bedone with so little data and so much noise? Surprisingly, the situation is notthat bad and on some simple models we can assess to which extent meaningfulstatistical methods can be applied. Regression is one such simple model.

Regression analysis can be traced back to 1632 when Galileo Galilei useda procedure to infer a linear relationship from noisy data. It was not untilthe early 19th century that Gauss and Legendre developed a systematic pro-cedure: the least-squares method. Since then, regression has been studiedin so many forms that much insight has been gained and recent advances onhigh-dimensional statistics would not have been possible without standing onthe shoulders of giants. In these notes, we will explore one, obviously subjec-tive giant on whose shoulders high-dimensional statistics stand: nonparametricstatistics.

The works of Ibragimov and Has’minskii in the seventies followed by manyresearchers from the Russian school have contributed to developing a largetoolkit to understand regression with an infinite number of parameters. Muchinsight from this work can be gained to understand high-dimensional or sparseregression and it comes as no surprise that Donoho and Johnstone have madethe first contributions on this topic in the early nineties.

Therefore, while not obviously connected to high dimensional statistics, wewill talk about nonparametric estimation. I borrowed this disclaimer (and thetemplate) from my colleague Ramon van Handel. It does apply here.

I have no illusions about the state of these notes—they were written

i

Preface ii

rather quickly, sometimes at the rate of a chapter a week. I haveno doubt that many errors remain in the text; at the very leastmany of the proofs are extremely compact, and should be made alittle clearer as is befitting of a pedagogical (?) treatment. If I haveanother opportunity to teach such a course, I will go over the notesagain in detail and attempt the necessary modifications. For thetime being, however, the notes are available as-is.

As any good set of notes, they should be perpetually improved and updatedbut a two or three year horizon is more realistic. Therefore, if you have anycomments, questions, suggestions, omissions, and of course mistakes, please letme know. I can be contacted by e-mail at [email protected].

Acknowledgements. These notes were improved thanks to the careful read-ing and comments of Mark Cerenzia, Youssef El Moujahid, Georgina Hall,Gautam Kamath, Hengrui Luo, Kevin Lin, Ali Makhdoumi, Yaroslav Mukhin,Mehtaab Sawhney, Ludwig Schmidt, Bastian Schroeter, Vira Semenova, Math-ias Vetter, Yuyan Wang, Jonathan Weed, Chiyuan Zhang and Jianhao Zhang.

These notes were written under the partial support of the National ScienceFoundation, CAREER award DMS-1053987.

Required background. I assume that the reader has had basic courses inprobability and mathematical statistics. Some elementary background in anal-ysis and measure theory is helpful but not required. Some basic notions oflinear algebra, especially spectral decomposition of matrices is required for thelater chapters.

Since the first version of these notes was posted a couple of manuscripts onhigh-dimensional probability by Ramon van Handel [vH17] and Roman Ver-shynin [Ver18] were published. Both are of outstanding quality—much higherthan the present notes—and very related to this material. I strongly recom-mend the reader to learn about this fascinating topic in parallel with high-dimensional statistics.

Notation

Functions, sets, vectors

[n] Set of integers [n] = {1, . . . , n}Sd−1 Unit sphere in dimension d

1I( · ) Indicator function

|x|q `q norm of x defined by |x|q =(∑

i |xi|q) 1q for q > 0

|x|0 `0 norm of x defined to be the number of nonzero coordinates of x

f (k) k-th derivative of f

ej j-th vector of the canonical basis

Ac complement of set A

conv(S) Convex hull of set S.

an . bn an ≤ Cbn for a numerical constant C > 0

Sn symmetric group on n elements

Matrices

Ip Identity matrix of IRp

Tr(A) trace of a square matrix A

Sd Symmetric matrices in IRd×d

S+d Symmetric positive semi-definite matrices in IRd×d

S++d Symmetric positive definite matrices in IRd×d

A � B Order relation given by B−A ∈ S+

A ≺ B Order relation given by B−A ∈ S++

M† Moore-Penrose pseudoinverse of M

∇xf(x) Gradient of f at x

∇xf(x)|x=x0 Gradient of f at x0

Distributions

iii

Preface iv

N (µ, σ2) Univariate Gaussian distribution with mean µ ∈ IR and variance σ2 > 0

Nd(µ,Σ) d-variate distribution with mean µ ∈ IRd and covariance matrix Σ ∈ IRd×d

subG(σ2) Univariate sub-Gaussian distributions with variance proxy σ2 > 0

subGd(σ2) d-variate sub-Gaussian distributions with variance proxy σ2 > 0

subE(σ2) sub-Exponential distributions with variance proxy σ2 > 0

Ber(p) Bernoulli distribution with parameter p ∈ [0, 1]

Bin(n, p) Binomial distribution with parameters n ≥ 1, p ∈ [0, 1]

Lap(λ) Double exponential (or Laplace) distribution with parameter λ > 0

PX Marginal distribution of X

Function spaces

W (β, L) Sobolev class of functions

Θ(β,Q) Sobolev ellipsoid of `2(IN)

Contents

Preface i

Notation iii

Contents v

Introduction 1

1 Sub-Gaussian Random Variables 141.1 Gaussian tails and MGF . . . . . . . . . . . . . . . . . . . . . . 141.2 Sub-Gaussian random variables and Chernoff bounds . . . . . . 161.3 Sub-exponential random variables . . . . . . . . . . . . . . . . . 221.4 Maximal inequalities . . . . . . . . . . . . . . . . . . . . . . . . 251.5 Sums of independent random matrices . . . . . . . . . . . . . . 301.6 Problem set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2 Linear Regression Model 402.1 Fixed design linear regression . . . . . . . . . . . . . . . . . . . 402.2 Least squares estimators . . . . . . . . . . . . . . . . . . . . . . 422.3 The Gaussian Sequence Model . . . . . . . . . . . . . . . . . . 492.4 High-dimensional linear regression . . . . . . . . . . . . . . . . 542.5 Problem set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

3 Misspecified Linear Models 723.1 Oracle inequalities . . . . . . . . . . . . . . . . . . . . . . . . . 733.2 Nonparametric regression . . . . . . . . . . . . . . . . . . . . . 813.3 Problem Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4 Minimax Lower Bounds 924.1 Optimality in a minimax sense . . . . . . . . . . . . . . . . . . 924.2 Reduction to finite hypothesis testing . . . . . . . . . . . . . . 944.3 Lower bounds based on two hypotheses . . . . . . . . . . . . . 954.4 Lower bounds based on many hypotheses . . . . . . . . . . . . 1014.5 Application to the Gaussian sequence model . . . . . . . . . . . 1044.6 Lower bounds for sparse estimation via χ2 divergence . . . . . 109

v

Contents vi

4.7 Lower bounds for estimating the `1 norm via moment matching 1124.8 Problem Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

5 Matrix estimation 1215.1 Basic facts about matrices . . . . . . . . . . . . . . . . . . . . . 1215.2 Multivariate regression . . . . . . . . . . . . . . . . . . . . . . . 1245.3 Covariance matrix estimation . . . . . . . . . . . . . . . . . . . 1325.4 Principal component analysis . . . . . . . . . . . . . . . . . . . 1345.5 Graphical models . . . . . . . . . . . . . . . . . . . . . . . . . . 1415.6 Problem set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

Bibliography 157

Introduction

This course is mainly about learning a regression function from a collectionof observations. In this chapter, after defining this task formally, we givean overview of the course and the questions around regression. We adoptthe statistical learning point of view where the task of prediction prevails.Nevertheless many interesting questions will remain unanswered when the lastpage comes: testing, model selection, implementation,. . .

REGRESSION ANALYSIS AND PREDICTION RISK

Model and definitions

Let (X,Y ) ∈ X ×Y where X is called feature and lives in a topological space Xand Y ∈ Y ⊂ IR is called response or sometimes label when Y is a discrete set,e.g., Y = {0, 1}. Often X ⊂ IRd, in which case X is called vector of covariatesor simply covariate. Our goal will be to predict Y given X and for our problemto be meaningful, we need Y to depend non-trivially on X. Our task would bedone if we had access to the conditional distribution of Y given X. This is theworld of the probabilist. The statistician does not have access to this valuableinformation but rather has to estimate it, at least partially. The regressionfunction gives a simple summary of this conditional distribution, namely, theconditional expectation.

Formally, the regression function of Y onto X is defined by:

f(x) = IE[Y |X = x] , x ∈ X .

As we will see, it arises naturally in the context of prediction.

Best prediction and prediction risk

Suppose for a moment that you know the conditional distribution of Y givenX. Given the realization of X = x, your goal is to predict the realization ofY . Intuitively, we would like to find a measurable1 function g : X → Y suchthat g(X) is close to Y , in other words, such that |Y − g(X)| is small. But|Y − g(X)| is a random variable, so it not clear what “small” means in thiscontext. A somewhat arbitrary answer can be given by declaring a random

1all topological spaces are equipped with their Borel σ-algebra

1

Introduction 2

variable Z small if IE[Z2] = [IEZ]2 + var[Z] is small. Indeed in this case, theexpectation of Z is small and the fluctuations of Z around this value are alsosmall. The function R(g) = IE[Y − g(X)]2 is called the L2 risk of g defined forIEY 2 <∞.

For any measurable function g : X → IR, the L2 risk of g can be decom-posed as

IE[Y − g(X)]2 = IE[Y − f(X) + f(X)− g(X)]2

= IE[Y − f(X)]2 + IE[f(X)− g(X)]2 + 2IE[Y − f(X)][f(X)− g(X)]

The cross-product term satisfies

IE[Y − f(X)][f(X)− g(X)] = IE[IE([Y − f(X)][f(X)− g(X)]

∣∣X)]= IE

[[IE(Y |X)− f(X)][f(X)− g(X)]

]= IE

[[f(X)− f(X)][f(X)− g(X)]

]= 0 .

The above two equations yield

IE[Y − g(X)]2 = IE[Y − f(X)]2 + IE[f(X)− g(X)]2 ≥ IE[Y − f(X)]2 ,

with equality iff f(X) = g(X) almost surely.We have proved that the regression function f(x) = IE[Y |X = x], x ∈ X ,

enjoys the best prediction property, that is

IE[Y − f(X)]2 = infg

IE[Y − g(X)]2 ,

where the infimum is taken over all measurable functions g : X → IR.

Prediction and estimation

As we said before, in a statistical problem, we do not have access to the condi-tional distribution of Y given X or even to the regression function f of Y ontoX. Instead, we observe a sample Dn = {(X1, Y1), . . . , (Xn, Yn)} that consistsof independent copies of (X,Y ). The goal of regression function estimation is

to use this data to construct an estimator fn : X → Y that has small L2 riskR(fn).

Let PX denote the marginal distribution of X and for any h : X → IR,define

‖h‖22 =

∫Xh2dPX .

Note that ‖h‖22 is the Hilbert norm associated to the inner product

〈h, h′〉2 =

∫Xhh′dPX .

When the reference measure is clear from the context, we will simply write‖h‖2 = ‖h‖L2(PX) and 〈h, h′〉2 := 〈h, h′〉L2(PX).

Introduction 3

It follows from the proof of the best prediction property above that

R(fn) = IE[Y − f(X)]2 + ‖fn − f‖22= inf

gIE[Y − g(X)]2 + ‖fn − f‖22

In particular, the prediction risk will always be at least equal to the positiveconstant IE[Y − f(X)]2. Since we tend to prefer a measure of accuracy to beable to go to zero (as the sample size increases), it is equivalent to study the

estimation error ‖fn − f‖22. Note that if fn is random, then ‖fn − f‖22 and

R(fn) are random quantities and we need deterministic summaries to quantifytheir size. It is customary to use one of the two following options. Let {φn}nbe a sequence of positive numbers that tends to zero as n goes to infinity.

1. Bounds in expectation. They are of the form:

IE‖fn − f‖22 ≤ φn ,

where the expectation is taken with respect to the sample Dn. Theyindicate the average behavior of the estimator over multiple realizationsof the sample. Such bounds have been established in nonparametricstatistics where typically φn = O(n−α) for some α ∈ (1/2, 1) for example.

Note that such bounds do not characterize the size of the deviation of therandom variable ‖fn−f‖22 around its expectation. As a result, it may betherefore appropriate to accompany such a bound with the second optionbelow.

2. Bounds with high-probability. They are of the form:

IP[‖fn − f‖22 > φn(δ)

]≤ δ , ∀δ ∈ (0, 1/3) .

Here 1/3 is arbitrary and can be replaced by another positive constant.

Such bounds control the tail of the distribution of ‖fn− f‖22. They show

how large the quantiles of the random variable ‖f − fn‖22 can be. Suchbounds are favored in learning theory, and are sometimes called PAC-bounds (for Probably Approximately Correct).

Often, bounds with high probability follow from a bound in expectation and aconcentration inequality that bounds the following probability

IP[‖fn − f‖22 − IE‖fn − f‖22 > t

]by a quantity that decays to zero exponentially fast. Concentration of measureis a fascinating but wide topic and we will only briefly touch it. We recommendthe reading of [BLM13] to the interested reader. This book presents manyaspects of concentration that are particularly well suited to the applicationscovered in these notes.

Introduction 4

Other measures of error

We have chosen the L2 risk somewhat arbitrarily. Why not the Lp risk definedby g 7→ IE|Y − g(X)|p for some p ≥ 1? The main reason for choosing the L2

risk is that it greatly simplifies the mathematics of our problem: it is a Hilbertspace! In particular, for any estimator fn, we have the remarkable identity:

R(fn) = IE[Y − f(X)]2 + ‖fn − f‖22 .

This equality allowed us to consider only the part ‖fn − f‖22 as a measure oferror. While this decomposition may not hold for other risk measures, it maybe desirable to explore other distances (or pseudo-distances). This leads to two

distinct ways to measure error. Either by bounding a pseudo-distance d(fn, f)

(estimation error) or by bounding the risk R(fn) for choices other than the L2

risk. These two measures coincide up to the additive constant IE[Y − f(X)]2

in the case described above. However, we show below that these two quantitiesmay live independent lives. Bounding the estimation error is more customaryin statistics whereas risk bounds are preferred in learning theory.

Here is a list of choices for the pseudo-distance employed in the estimationerror.

• Pointwise error. Given a point x0, the pointwise error measures onlythe error at this point. It uses the pseudo-distance:

d0(fn, f) = |fn(x0)− f(x0)| .

• Sup-norm error. Also known as the L∞-error and defined by

d∞(fn, f) = supx∈X|fn(x)− f(x)| .

It controls the worst possible pointwise error.

• Lp-error. It generalizes both the L2 distance and the sup-norm error bytaking for any p ≥ 1, the pseudo distance

dp(fn, f) =

∫X|fn − f |pdPX .

The choice of p is somewhat arbitrary and mostly employed as a mathe-matical exercise.

Note that these three examples can be split into two families: global (Sup-normand Lp) and local (pointwise).

For specific problems, other considerations come into play. For example,if Y ∈ {0, 1} is a label, one may be interested in the classification risk of aclassifier h : X → {0, 1}. It is defined by

R(h) = IP(Y 6= h(X)) .

Introduction 5

We will not cover this problem in this course.Finally, we will devote a large part of these notes to the study of linear

models. For such models, X = IRd and f is linear (or affine), i.e., f(x) = x>θfor some unknown θ ∈ IRd. In this case, it is traditional to measure errordirectly on the coefficient θ. For example, if fn(x) = x>θn is a candidate

linear estimator, it is customary to measure the distance of fn to f using a(pseudo-)distance between θn and θ as long as θ is identifiable.

MODELS AND METHODS

Empirical risk minimization

In our considerations on measuring the performance of an estimator fn, wehave carefully avoided the question of how to construct fn. This is of courseone of the most important task of statistics. As we will see, it can be carriedout in a fairly mechanical way by following one simple principle: EmpiricalRisk Minimization (ERM2). Indeed, an overwhelming proportion of statisticalmethods consist in replacing an (unknown) expected value (IE) by a (known)empirical mean ( 1

n

∑ni=1). For example, it is well known that a good candidate

to estimate the expected value IEX of a random variable X from a sequence ofi.i.d copies X1, . . . , Xn of X, is their empirical average

X =1

n

n∑i=1

Xi .

In many instances, it corresponds to the maximum likelihood estimator of IEX.Another example is the sample variance, where IE(X− IE(X))2 is estimated by

1

n

n∑i=1

(Xi − X)2 .

It turns out that this principle can be extended even if an optimization followsthe substitution. Recall that the L2 risk is defined by R(g) = IE[Y −g(X)]2. Seethe expectation? Well, it can be replaced by an average to form the empiricalrisk of g defined by

Rn(g) =1

n

n∑i=1

(Yi − g(Xi)

)2.

We can now proceed to minimizing this risk. However, we have to be careful.Indeed, Rn(g) ≥ 0 for all g. Therefore any function g such that Yi = g(Xi) forall i = 1, . . . , n is a minimizer of the empirical risk. Yet, it may not be the bestchoice (Cf. Figure 1). To overcome this limitation, we need to leverage someprior knowledge on f : either it may belong to a certain class G of functions (e.g.,linear functions) or it is smooth (e.g., the L2-norm of its second derivative is

2ERM may also mean Empirical Risk Minimizer

Introduction 6

0.0 0.2 0.4 0.6 0.8

−2

.0−

1.5

−1

.0−

0.5

0.0

0.5

1.0

1.5

x

y

0.0 0.2 0.4 0.6 0.8−

2.0

−1

.5−

1.0

−0

.50

.00

.51

.01

.5

x

y

0.0 0.2 0.4 0.6 0.8

−2

.0−

1.5

−1

.0−

0.5

0.0

0.5

1.0

1.5

x

y

Figure 1. It may not be the best choice idea to have fn(Xi) = Yi for all i = 1, . . . , n.

small). In both cases, this extra knowledge can be incorporated to ERM usingeither a constraint :

ming∈G

Rn(g)

or a penalty :

ming

{Rn(g) + pen(g)

},

or bothming∈G

{Rn(g) + pen(g)

},

These schemes belong to the general idea of regularization. We will see manyvariants of regularization throughout the course.

Unlike traditional (low dimensional) statistics, computation plays a key rolein high-dimensional statistics. Indeed, what is the point of describing an esti-mator with good prediction properties if it takes years to compute it on largedatasets? As a result of this observation, much of the modern estimators, suchas the Lasso estimator for sparse linear regression can be computed efficientlyusing simple tools from convex optimization. We will not describe such algo-rithms for this problem but will comment on the computability of estimatorswhen relevant.

In particular computational considerations have driven the field of com-pressed sensing that is closely connected to the problem of sparse linear regres-sion studied in these notes. We will only briefly mention some of the results andrefer the interested reader to the book [FR13] for a comprehensive treatment.

Introduction 7

Linear models

When X = IRd, an all time favorite constraint G is the class of linear functionsthat are of the form g(x) = x>θ, that is parametrized by θ ∈ IRd. Underthis constraint, the estimator obtained by ERM is usually called least squaresestimator and is defined by fn(x) = x>θ, where

θ ∈ argminθ∈IRd

1

n

n∑i=1

(Yi −X>i θ)2 .

Note that θ may not be unique. In the case of a linear model, where we assumethat the regression function is of the form f(x) = x>θ∗ for some unknownθ∗ ∈ IRd, we will need assumptions to ensure identifiability if we want to provebounds on d(θ, θ∗) for some specific pseudo-distance d(· , ·). Nevertheless, inother instances such as regression with fixed design, we can prove bounds onthe prediction error that are valid for any θ in the argmin. In the latter case,we will not even require that f satisfies the linear model but our bound willbe meaningful only if f can be well approximated by a linear function. In thiscase, we talk about a misspecified model, i.e., we try to fit a linear model todata that may not come from a linear model. Since linear models can havegood approximation properties especially when the dimension d is large, ourhope is that the linear model is never too far from the truth.

In the case of a misspecified model, there is no hope to drive the estimationerror d(fn, f) down to zero even with a sample size that tends to infinity.Rather, we will pay a systematic approximation error. When G is a linearsubspace as above, and the pseudo distance is given by the squared L2 normd(fn, f) = ‖fn − f‖22, it follows from the Pythagorean theorem that

‖fn − f‖22 = ‖fn − f‖22 + ‖f − f‖22 ,

where f is the projection of f onto the linear subspace G. The systematicapproximation error is entirely contained in the deterministic term ‖f − f‖22and one can proceed to bound ‖fn − f‖22 by a quantity that goes to zero as ngoes to infinity. In this case, bounds (e.g., in expectation) on the estimationerror take the form

IE‖fn − f‖22 ≤ ‖f − f‖22 + φn .

The above inequality is called an oracle inequality. Indeed, it says that if φnis small enough, then fn the estimator mimics the oracle f . It is called “oracle”because it cannot be constructed without the knowledge of the unknown f . Itis clearly the best we can do when we restrict our attention to estimator in theclass G. Going back to the gap in knowledge between a probabilist who knowsthe whole joint distribution of (X,Y ) and a statistician who only sees the data,the oracle sits somewhere in-between: it can only see the whole distributionthrough the lens provided by the statistician. In the case above, the lens isthat of linear regression functions. Different oracles are more or less powerfuland there is a tradeoff to be achieved. On the one hand, if the oracle is weak,

Introduction 8

then it’s easy for the statistician to mimic it but it may be very far from thetrue regression function; on the other hand, if the oracle is strong, then it isharder to mimic but it is much closer to the truth.

Oracle inequalities were originally developed as analytic tools to prove adap-tation of some nonparametric estimators. With the development of aggregation[Nem00, Tsy03, Rig06] and high dimensional statistics [CT07, BRT09, RT11],they have become important finite sample results that characterize the inter-play between the important parameters of the problem.

In some favorable instances, that is when the Xis enjoy specific properties,it is even possible to estimate the vector θ accurately, as is done in parametricstatistics. The techniques employed for this goal will essentially be the sameas the ones employed to minimize the prediction risk. The extra assumptionson the Xis will then translate in interesting properties on θ itself, includinguniqueness on top of the prediction properties of the function fn(x) = x>θ.

High dimension and sparsity

These lecture notes are about high dimensional statistics and it is time theyenter the picture. By high dimension, we informally mean that the model hasmore “parameters” than there are observations. The word “parameter” is usedhere loosely and a more accurate description is perhaps degrees of freedom. Forexample, the linear model f(x) = x>θ∗ has one parameter θ∗ but effectively ddegrees of freedom when θ∗ ∈ IRd. The notion of degrees of freedom is actuallywell defined in the statistical literature but the formal definition does not helpour informal discussion here.

As we will see in Chapter 2, if the regression function is linear f(x) = x>θ∗,θ∗ ∈ IRd, and under some assumptions on the marginal distribution of X, thenthe least squares estimator fn(x) = x>θn satisfies

IE‖fn − f‖22 ≤ Cd

n, (1)

where C > 0 is a constant and in Chapter 4, we will show that this cannotbe improved apart perhaps for a smaller multiplicative constant. Clearly sucha bound is uninformative if d � n and actually, in view of its optimality,we can even conclude that the problem is too difficult statistically. However,the situation is not hopeless if we assume that the problem has actually lessdegrees of freedom than it seems. In particular, it is now standard to resort tothe sparsity assumption to overcome this limitation.

A vector θ ∈ IRd is said to be k-sparse for some k ∈ {0, . . . , d} if it hasat most k non-zero coordinates. We denote by |θ|0 the number of nonzerocoordinates of θ, which is also known as sparsity or “`0-norm” though it isclearly not a norm (see footnote 3). Formally, it is defined as

|θ|0 =

d∑j=1

1I(θj 6= 0) .

Introduction 9

Sparsity is just one of many ways to limit the size of the set of potentialθ vectors to consider. One could consider vectors θ that have the followingstructure for example (see Figure 2):

• Monotonic: θ1 ≥ θ2 ≥ · · · ≥ θd

• Smooth: |θi − θj | ≤ C|i− j|α for some α > 0

• Piecewise constant:∑d−1j=1 1I(θj+1 6= θj) ≤ k

• Structured in another basis: θ = Ψµ, for some orthogonal matrix and µis in one of the structured classes described above.

Sparsity plays a significant role in statistics because, often, structure trans-lates into sparsity in a certain basis. For example, a smooth function is sparsein the trigonometric basis and a piecewise constant function has sparse incre-ments. Moreover, real images are approximately sparse in certain bases such aswavelet or Fourier bases. This is precisely the feature exploited in compressionschemes such as JPEG or JPEG-2000: only a few coefficients in these imagesare necessary to retain the main features of the image.

We say that θ is approximately sparse if |θ|0 may be as large as d but manycoefficients |θj | are small rather than exactly equal to zero. There are severalmathematical ways to capture this phenomena, including `q-“balls” for q ≤ 1.For q > 0, the unit `q-ball of IRd is defined as

Bq(R) ={θ ∈ IRd : |θ|qq =

d∑j=1

|θj |q ≤ 1}

where |θ|q is often called `q-norm3. As we will see, the smaller q is, the bettervectors in the unit `q ball can be approximated by sparse vectors.

Note that the set of k-sparse vectors of IRd is a union of∑kj=0

(dj

)linear

subspaces with dimension at most k and that are spanned by at most k vectorsin the canonical basis of IRd. If we knew that θ∗ belongs to one of thesesubspaces, we could simply drop irrelevant coordinates and obtain an oracleinequality such as (1), with d replaced by k. Since we do not know whatsubspace θ∗ lives exactly, we will have to pay an extra term to find in whichsubspace θ∗ lives. It turns out that this term is exactly of the order of

log(∑k

j=0

(dj

))n

' Ck log

(edk

)n

Therefore, the price to pay for not knowing which subspace to look at is onlya logarithmic factor.

3Strictly speaking, |θ|q is a norm and the `q ball is a ball only for q ≥ 1.

Introduction 10

Monotone

x

y1

Smooth

x

y2

Constant

x

y3

Basis

x

y4

Figure 2. Examples of structures vectors θ ∈ IR50

Nonparametric regression

Nonparametric does not mean that there is no parameter to estimate (theregression function is a parameter) but rather that the parameter to estimateis infinite dimensional (this is the case of a function). In some instances, thisparameter can be identified with an infinite sequence of real numbers, so thatwe are still in the realm of countable infinity. Indeed, observe that since L2(PX)equipped with the inner product 〈· , ·〉2 is a separable Hilbert space, it admits anorthonormal basis {ϕk}k∈Z and any function f ∈ L2(PX) can be decomposedas

f =∑k∈Z

αkϕk ,

where αk = 〈f, ϕk〉2.Therefore estimating a regression function f amounts to estimating the

infinite sequence {αk}k∈Z ∈ `2. You may argue (correctly) that the basis{ϕk}k∈Z is also unknown as it depends on the unknown PX . This is absolutely

Introduction 11

correct but we will make the convenient assumption that PX is (essentially)known whenever this is needed.

Even if infinity is countable, we still have to estimate an infinite numberof coefficients using a finite number of observations. It does not require muchstatistical intuition to realize that this task is impossible in general. What ifwe know something about the sequence {αk}k? For example, if we know thatαk = 0 for |k| > k0, then there are only 2k0 + 1 parameters to estimate (ingeneral, one would also have to “estimate” k0). In practice, we will not exactlysee αk = 0 for |k| > k0, but rather that the sequence {αk}k decays to 0 ata certain polynomial rate. For example |αk| ≤ C|k|−γ for some γ > 1/2 (weneed this sequence to be in `2). It corresponds to a smoothness assumption onthe function f . In this case, the sequence {αk}k can be well approximated bya sequence with only a finite number of non-zero terms.

We can view this problem as a misspecified model. Indeed, for any cut-offk0, define the oracle

fk0=∑|k|≤k0

αkϕk .

Note that it depends on the unknown αk and define the estimator

fn =∑|k|≤k0

αkϕk ,

where αk are some data-driven coefficients (obtained by least-squares for ex-ample). Then by the Pythagorean theorem and Parseval’s identity, we have

‖fn − f‖22 = ‖f − f‖22 + ‖fn − f‖22=∑|k|>k0

α2k +

∑|k|≤k0

(αk − αk)2

We can even work further on this oracle inequality using the fact that |αk| ≤C|k|−γ . Indeed, we have4∑

|k|>k0

α2k ≤ C2

∑|k|>k0

k−2γ ≤ Ck1−2γ0 .

The so called stochastic term IE∑|k|≤k0

(αk − αk)2 clearly increases with k0

(more parameters to estimate) whereas the approximation term Ck1−2γ0 de-

creases with k0 (less terms discarded). We will see that we can strike a com-promise called bias-variance tradeoff.

The main difference between this and oracle inequalities is that we makeassumptions on the regression function (here in terms of smoothness) in order

4Here we illustrate a convenient notational convention that we will be using through-out these notes: a constant C may be different from line to line. This will not affect theinterpretation of our results since we are interested in the order of magnitude of the errorbounds. Nevertheless we will, as much as possible, try to make such constants explicit. Asan exercise, try to find an expression of the second C as a function of the first one and of γ.

Introduction 12

to control the approximation error. Therefore oracle inequalities are moregeneral but can be seen on the one hand as less quantitative. On the otherhand, if one is willing to accept the fact that approximation error is inevitablethen there is no reason to focus on it. This is not the final answer to this ratherphilosophical question. Indeed, choosing the right k0 can only be done witha control of the approximation error. Indeed, the best k0 will depend on γ.We will see that even if the smoothness index γ is unknown, we can select k0

in a data-driven way that achieves almost the same performance as if γ wereknown. This phenomenon is called adaptation (to γ).

It is important to notice the main difference between the approach taken innonparametric regression and the one in sparse linear regression. It is not somuch about linear vs. non-linear models as we can always first take nonlineartransformations of the xj ’s in linear regression. Instead, sparsity or approx-imate sparsity is a much weaker notion than the decay of coefficients {αk}kpresented above. In a way, sparsity only imposes that after ordering the coef-ficients present a certain decay, whereas in nonparametric statistics, the orderis set ahead of time: we assume that we have found a basis that is ordered insuch a way that coefficients decay at a certain rate.

Matrix models

In the previous examples, the response variable is always assumed to be a scalar.What if it is a higher dimensional signal? In Chapter 5, we consider variousproblems of this form: matrix completion a.k.a. the Netflix problem, structuredgraph estimation and covariance matrix estimation. All these problems can bedescribed as follows.

Let M,S and N be three matrices, respectively called observation, signaland noise, and that satisfy

M = S +N .

Here N is a random matrix such that IE[N ] = 0, the all-zero matrix. The goalis to estimate the signal matrix S from the observation of M .

The structure of S can also be chosen in various ways. We will consider thecase where S is sparse in the sense that it has many zero coefficients. In a way,this assumption does not leverage much of the matrix structure and essentiallytreats matrices as vectors arranged in the form of an array. This is not the caseof low rank structures where one assumes that the matrix S has either low rankor can be well approximated by a low rank matrix. This assumption makessense in the case where S represents user preferences as in the Netflix example.In this example, the (i, j)th coefficient Sij of S corresponds to the rating (on ascale from 1 to 5) that user i gave to movie j. The low rank assumption simplymaterializes the idea that there are a few canonical profiles of users and thateach user can be represented as a linear combination of these users.

At first glance, this problem seems much more difficult than sparse linearregression. Indeed, one needs to learn not only the sparse coefficients in a given

Introduction 13

basis, but also the basis of eigenvectors. Fortunately, it turns out that the lattertask is much easier and is dominated by the former in terms of statistical price.

Another important example of matrix estimation is high-dimensional co-variance estimation, where the goal is to estimate the covariance matrix of arandom vector X ∈ IRd, or its leading eigenvectors, based on n observations.Such a problem has many applications including principal component analysis,linear discriminant analysis and portfolio optimization. The main difficulty isthat n may be much smaller than the number of degrees of freedom in thecovariance matrix, which can be of order d2. To overcome this limitation,assumptions on the rank or the sparsity of the matrix can be leveraged.

Optimality and minimax lower bounds

So far, we have only talked about upper bounds. For a linear model, wheref(x) = x>θ∗, we will prove in Chapter 2 the following bound for a modified

least squares estimator fn = x>θ

IE‖fn − f‖22 ≤ Cd

n.

Is this the right dependence in d and n? Would it be possible to obtain as anupper bound something like C(log d)/n, C/n or

√d/n2, by either improving

our proof technique or using another estimator altogether? It turns out thatthe answer to this question is negative. More precisely, we can prove that forany estimator fn, there exists a function f of the form f(x) = x>θ∗ such that

IE‖fn − f‖22 > cd

n

for some positive constant c. Here we used a different notation for the constantto emphasize the fact that lower bounds guarantee optimality only up to aconstant factor. Such a lower bound on the risk is called minimax lower boundfor reasons that will become clearer in chapter 4.

How is this possible? How can we make a statement for all estimators?We will see that these statements borrow from the theory of tests where weknow that it is impossible to drive both the type I and the type II error tozero simultaneously (with a fixed sample size). Intuitively this phenomenonis related to the following observation: Given n observations X1, . . . , Xn, it ishard to tell if they are distributed according to N (θ, 1) or to N (θ′, 1) for aEuclidean distance |θ− θ′|2 is small enough. We will see that it is the case forexample if |θ − θ′|2 ≤ C

√d/n, which will yield our lower bound.

Chapter

1Sub-Gaussian Random Variables

1.1 GAUSSIAN TAILS AND MGF

Recall that a random variable X ∈ IR has Gaussian distribution iff it has adensity p with respect to the Lebesgue measure on IR given by

p(x) =1√

2πσ2exp

(− (x− µ)2

2σ2

), x ∈ IR ,

where µ = IE(X) ∈ IR and σ2 = var(X) > 0 are the mean and variance ofX. We write X ∼ N (µ, σ2). Note that X = σZ + µ for Z ∼ N (0, 1) (calledstandard Gaussian) and where the equality holds in distribution. Clearly, thisdistribution has unbounded support but it is well known that it has almostbounded support in the following sense: IP(|X −µ| ≤ 3σ) ' 0.997. This is dueto the fast decay of the tails of p as |x| → ∞ (see Figure 1.1). This decay canbe quantified using the following proposition (Mill’s inequality).

Proposition 1.1. Let X be a Gaussian random variable with mean µ andvariance σ2 then for any t > 0, it holds

IP(X − µ > t) ≤ σ√2π

e−t2

2σ2

t.

By symmetry we also have

IP(X − µ < −t) ≤ σ√2π

e−t2

2σ2

t

14

1.1. Gaussian tails and MGF 15

µ − 3σ µ − 2σ µ − σ µ µ + σ µ + 2σ µ + 3σ

99.7%

95%

68%

Figure 1.1. Probabilities of falling within 1, 2, and 3 standard deviations close to themean in a Gaussian distribution. Source http://www.openintro.org/

and

IP(|X − µ| > t) ≤ σ√

2

π

e−t2

2σ2

t.

Proof. Note that it is sufficient to prove the theorem for µ = 0 and σ2 = 1 bysimple translation and rescaling. We get for Z ∼ N (0, 1),

IP(Z > t) =1√2π

∫ ∞t

exp(− x2

2

)dx

≤ 1√2π

∫ ∞t

x

texp

(− x2

2

)dx

=1

t√

2π

∫ ∞t

− ∂

∂xexp

(− x2

2

)dx

=1

t√

2πexp(−t2/2) .

The second inequality follows from symmetry and the last one using the unionbound:

IP(|Z| > t) = IP({Z > t} ∪ {Z < −t}) ≤ IP(Z > t) + IP(Z < −t) = 2IP(Z > t) .

The fact that a Gaussian random variable Z has tails that decay to zeroexponentially fast can also be seen in the moment generating function (MGF)

M : s 7→M(s) = IE[exp(sZ)] .

http://www.openintro.org/

1.2. Sub-Gaussian random variables and Chernoff bounds 16

Indeed, in the case of a standard Gaussian random variable, we have

M(s) = IE[exp(sZ)] =1√2π

∫esze−

z2

2 dz

=1√2π

∫e−

(z−s)22 + s2

2 dz

= es2

2 .

It follows that if X ∼ N (µ, σ2), then IE[exp(sX)] = exp(sµ+ σ2s2

2

).

1.2 SUB-GAUSSIAN RANDOM VARIABLES AND CHERNOFF BOUNDS

Definition and first properties

Gaussian tails are practical when controlling the tail of an average of indepen-dent random variables. Indeed, recall that if X1, . . . , Xn areiid N (µ, σ2), then X = 1

n

∑ni=1Xi ∼ N (µ, σ2/n). Using Lemma 1.3 below for

example, we get

IP(|X − µ| > t) ≤ 2 exp(− nt2

2σ2

).

Equating the right-hand side with some confidence level δ > 0, we find thatwith probability at least1 1− δ,

µ ∈[X − σ

√2 log(2/δ)

n, X + σ

√2 log(2/δ)

n

], (1.1)

This is almost the confidence interval that you used in introductory statistics.The only difference is that we used an approximation for the Gaussian tailwhereas statistical tables or software use a much more accurate computation.Figure 1.2 shows the ratio of the width of the confidence interval to that of theconfidence interval computed by the software R. It turns out that intervals ofthe same form can be also derived for non-Gaussian random variables as longas they have sub-Gaussian tails.

Definition 1.2. A random variable X ∈ IR is said to be sub-Gaussian withvariance proxy σ2 if IE[X] = 0 and its moment generating function satisfies

IE[exp(sX)] ≤ exp(σ2s2

2

), ∀ s ∈ IR . (1.2)

In this case we write X ∼ subG(σ2). Note that subG(σ2) denotes a class ofdistributions rather than a distribution. Therefore, we abuse notation whenwriting X ∼ subG(σ2).

More generally, we can talk about sub-Gaussian random vectors and matri-ces. A random vector X ∈ IRd is said to be sub-Gaussian with variance proxy

1We will often commit the statement “at least” for brevity


0 5 10 15 20

34

56

7

x

y

Figure 1.2. Width of confidence intervals from exact computation in R (red dashed)and from the approximation (1.1) (solid black). Here x = δ and y is the width of theconfidence intervals.

σ2 if IE[X] = 0 and u>X is sub-Gaussian with variance proxy σ2 for any vectoru ∈ Sd−1. In this case we write X ∼ subGd(σ

2). Note that if X ∼ subGd(σ2),

then for any v such that |v|2 ≤ 1, we have v>X ∼ subG(σ2). Indeed, denotingu = v/|v|2 ∈ Sd−1, we have

IE[esv>X ] = IE[es|v|2u

>X ] ≤ eσ2s2|v|22

2 ≤ eσ2s2

2 .

A random matrix X ∈ IRd×T is said to be sub-Gaussian with variance proxyσ2 if IE[X] = 0 and u>Xv is sub-Gaussian with variance proxy σ2 for any unitvectors u ∈ Sd−1, v ∈ ST−1. In this case we write X ∼ subGd×T (σ2).

This property can equivalently be expressed in terms of bounds on the tailof the random variable X.

Lemma 1.3. Let X ∼ subG(σ2). Then for any t > 0, it holds

IP[X > t] ≤ exp(− t2

2σ2

), and IP[X < −t] ≤ exp

(− t2

2σ2

). (1.3)

Proof. Assume first that X ∼ subG(σ2). We will employ a very useful tech-nique called Chernoff bound that allows to translate a bound on the momentgenerating function into a tail bound. Using Markov’s inequality, we have forany s > 0,

IP(X > t) ≤ IP(esX > est

)≤

IE[esX

]est

.


Next we use the fact that X is sub-Gaussian to get

IP(X > t) ≤ eσ2s2

2 −st .

The above inequality holds for any s > 0 so to make it the tightest possible, we

minimize with respect to s > 0. Solving φ′(s) = 0, where φ(s) = σ2s2

2 − st, we

find that infs>0 φ(s) = − t2

2σ2 . This proves the first part of (1.3). The secondinequality in this equation follows in the same manner (recall that (1.2) holdsfor any s ∈ IR).

Moments

Recall that the absolute moments of Z ∼ N (0, σ2) are given by

IE[|Z|k] =1√π

(2σ2)k/2Γ(k + 1

2

)where Γ(·) denotes the Gamma function defined by

Γ(t) =

∫ ∞0

xt−1e−xdx , t > 0 .

The next lemma shows that the tail bounds of Lemma 1.3 are sufficient toshow that the absolute moments of X ∼ subG(σ2) can be bounded by those ofZ ∼ N (0, σ2) up to multiplicative constants.

Lemma 1.4. Let X be a random variable such that

IP[|X| > t] ≤ 2 exp(− t2

2σ2

),

then for any positive integer k ≥ 1,

IE[|X|k] ≤ (2σ2)k/2kΓ(k/2) .

In particular, (IE[|X|k])1/k ≤ σe1/e

√k , k ≥ 2 .

and IE[|X|] ≤ σ√

2π .

Proof.

IE[|X|k] =

∫ ∞0

IP(|X|k > t)dt

=

∫ ∞0

IP(|X| > t1/k)dt

≤ 2

∫ ∞0

e−t2/k

2σ2 dt

= (2σ2)k/2k

∫ ∞0

e−uuk/2−1du , u =t2/k

2σ2

= (2σ2)k/2kΓ(k/2)


The second statement follows from Γ(k/2) ≤ (k/2)k/2 and k1/k ≤ e1/e for anyk ≥ 2. It yields

((2σ2)k/2kΓ(k/2)

)1/k ≤ k1/k

√2σ2k

2≤ e1/eσ

√k .

Moreover, for k = 1, we have√

2Γ(1/2) =√

2π.

Using moments, we can prove the following reciprocal to Lemma 1.3.

Lemma 1.5. If (1.3) holds and IE[X] = 0, then for any s > 0, it holds

IE[exp(sX)] ≤ e4σ2s2 .

As a result, we will sometimes write X ∼ subG(σ2) when it satisfies (1.3) andIE[X] = 0.

Proof. We use the Taylor expansion of the exponential function as follows.Observe that by the dominated convergence theorem

IE[esX

]≤ 1 +

∞∑k=2

skIE[|X|k]

k!

≤ 1 +

∞∑k=2

(2σ2s2)k/2kΓ(k/2)

k!

= 1 +

∞∑k=1

(2σ2s2)k2kΓ(k)

(2k)!+

∞∑k=1

(2σ2s2)k+1/2(2k + 1)Γ(k + 1/2)

(2k + 1)!

≤ 1 +(2 +√

2σ2s2) ∞∑k=1

(2σ2s2)kk!

(2k)!

≤ 1 +(1 +

√σ2s2

2

) ∞∑k=1

(2σ2s2)k

k!2(k!)2 ≤ (2k)!

= e2σ2s2 +

√σ2s2

2(e2σ2s2 − 1)

≤ e4σ2s2 .

From the above Lemma, we see that sub-Gaussian random variables canbe equivalently defined from their tail bounds and their moment generatingfunctions, up to constants.


Sums of independent sub-Gaussian random variables

Recall that if X1, . . . , Xn areiid N (0, σ2), then for any a ∈ IRn,

n∑i=1

aiXi ∼ N (0, |a|22σ2).

If we only care about the tails, this property is preserved for sub-Gaussianrandom variables.

Theorem 1.6. Let X = (X1, . . . , Xn) be a vector of independent sub-Gaussianrandom variables that have variance proxy σ2. Then, the random vector X issub-Gaussian with variance proxy σ2.

Proof. Let u ∈ Sn−1 be a unit vector, then

IE[esu>X ] =

n∏i=1

IE[esuiXi ] ≤n∏i=1

eσ2s2u2

i2 = e

σ2s2|u|222 = e

σ2s2

2 .

Using a Chernoff bound, we immediately get the following corollary

Corollary 1.7. Let X1, . . . , Xn be n independent random variables such thatXi ∼ subG(σ2). Then for any a ∈ IRn, we have

IP[ n∑i=1

aiXi > t]≤ exp

(− t2

2σ2|a|22

),

and

IP[ n∑i=1

aiXi < −t]≤ exp

(− t2

2σ2|a|22

)Of special interest is the case where ai = 1/n for all i. Then, we get that

the average X = 1n

∑ni=1Xi, satisfies

IP(X > t) ≤ e−nt2

2σ2 and IP(X < −t) ≤ e−nt2

2σ2

just like for the Gaussian average.

Hoeffding’s inequality

The class of sub-Gaussian random variables is actually quite large. Indeed,Hoeffding’s lemma below implies that all random variables that are boundeduniformly are actually sub-Gaussian with a variance proxy that depends on thesize of their support.


Lemma 1.8 (Hoeffding’s lemma (1963)). Let X be a random variable suchthat IE(X) = 0 and X ∈ [a, b] almost surely. Then, for any s ∈ IR, it holds

IE[esX ] ≤ es2(b−a)2

8 .

In particular, X ∼ subG( (b−a)2

4 ) .

Proof. Define ψ(s) = log IE[esX ], and observe that and we can readily compute

ψ′(s) =IE[XesX ]

IE[esX ], ψ′′(s) =

IE[X2esX ]

IE[esX ]−[

IE[XesX ]

IE[esX ]

]2

.

Thus ψ′′(s) can be interpreted as the variance of the random variable X under

the probability measure dQ = esX

IE[esX ]dIP. But since X ∈ [a, b] almost surely,

we have, under any probability,

var(X) = var(X − a+ b

2

)≤ IE

[(X − a+ b

2

)2]≤ (b− a)2

4.

The fundamental theorem of calculus yields

ψ(s) =

∫ s

0

∫ µ

0

ψ′′(ρ) dρ dµ ≤ s2(b− a)2

8

using ψ(0) = log 1 = 0 and ψ′(0) = IEX = 0.

Using a Chernoff bound, we get the following (extremely useful) result.

Theorem 1.9 (Hoeffding’s inequality). Let X1, . . . , Xn be n independent ran-dom variables such that almost surely,

Xi ∈ [ai, bi] , ∀ i.

Let X = 1n

∑ni=1Xi, then for any t > 0,

IP(X − IE(X) > t) ≤ exp(− 2n2t2∑n

i=1(bi − ai)2

),

and

IP(X − IE(X) < −t) ≤ exp(− 2n2t2∑n

i=1(bi − ai)2

).

Note that Hoeffding’s lemma holds for any bounded random variable. Forexample, if one knows that X is a Rademacher random variable,

IE(esX) =es + e−s

2= cosh(s) ≤ e s

2

2 .

Note that 2 is the best possible constant in the above approximation. For suchvariables, a = −1, b = 1, and IE(X) = 0, so Hoeffding’s lemma yields

IE(esX) ≤ e s2

2 .

1.3. Sub-exponential random variables 22

Hoeffding’s inequality is very general but there is a price to pay for this gen-erality. Indeed, if the random variables have small variance, we would like tosee it reflected in the exponential tail bound (as for the Gaussian case) but thevariance does not appear in Hoeffding’s inequality. We need a more refinedinequality.

1.3 SUB-EXPONENTIAL RANDOM VARIABLES

What can we say when a centered random variable is not sub-Gaussian?A typical example is the double exponential (or Laplace) distribution withparameter 1, denoted by Lap(1). Let X ∼ Lap(1) and observe that

IP(|X| > t) = e−t, t ≥ 0 .

In particular, the tails of this distribution do not decay as fast as the Gaussianones (that decays as e−t

2/2). Such tails are said to be heavier than Gaussian.This tail behavior is also captured by the moment generating function of X.Indeed, we have

IE[esX

]=

1

1− s2if |s| < 1 ,

and it is not defined for s ≥ 1. It turns out that a rather weak condition onthe moment generating function is enough to partially reproduce some of thebounds that we have proved for sub-Gaussian random variables. Observe thatfor X ∼ Lap(1)

IE[esX

]≤ e2s2 if |s| < 1/2 ,

In particular, the moment generating function of the Laplace distribution isbounded by that of a Gaussian in a neighborhood of 0 but does not even existaway from zero. It turns out that all distributions that have tails at most asheavy as that of a Laplace distribution satisfy such a property.

Lemma 1.10. Let X be a centered random variable such that IP(|X| > t) ≤2e−2t/λ for some λ > 0. Then, for any positive integer k ≥ 1,

IE[|X|k] ≤ λkk! .

Moreover, (IE[|X|k])1/k ≤ 2λk ,

and the moment generating function of X satisfies

IE[esX

]≤ e2s2λ2

, ∀|s| ≤ 1

2λ.


Proof.

IE[|X|k] =

∫ ∞0

IP(|X|k > t)dt

=

∫ ∞0

IP(|X| > t1/k)dt

≤∫ ∞

0

2e−2t1/k

λ dt

= 2(λ/2)kk

∫ ∞0

e−uuk−1du , u =2t1/k

λ

≤ λkkΓ(k) = λkk!

The second statement follows from Γ(k) ≤ kk and k1/k ≤ e1/e ≤ 2 for anyk ≥ 1. It yields (

λkkΓ(k))1/k ≤ 2λk .

To control the MGF of X, we use the Taylor expansion of the exponentialfunction as follows. Observe that by the dominated convergence theorem, forany s such that |s| ≤ 1/2λ

IE[esX

]≤ 1 +

∞∑k=2

|s|kIE[|X|k]

k!

≤ 1 +

∞∑k=2

(|s|λ)k

= 1 + s2λ2∞∑k=0

(|s|λ)k

≤ 1 + 2s2λ2 |s| ≤ 1

2λ

≤ e2s2λ2

This leads to the following definition

Definition 1.11. A random variable X is said to be sub-exponential withparameter λ (denoted X ∼ subE(λ)) if IE[X] = 0 and its moment generatingfunction satisfies

IE[esX

]≤ es

2λ2/2 , ∀|s| ≤ 1

λ.

A simple and useful example of of a sub-exponential random variable isgiven in the next lemma.

Lemma 1.12. Let X ∼ subG(σ2) then the random variable Z = X2 − IE[X2]is sub-exponential: Z ∼ subE(16σ2).


Proof. We have, by the dominated convergence theorem,

IE[esZ ] = 1 +

∞∑k=2

skIE[X2 − IE[X2]

]kk!

≤ 1 +

∞∑k=2

sk2k−1(IE[X2k] + (IE[X2])k

)k!

(Jensen)

≤ 1 +

∞∑k=2

sk4kIE[X2k]

2(k!)(Jensen again)

≤ 1 +

∞∑k=2

sk4k2(2σ2)kk!

2(k!)(Lemma 1.4)

= 1 + (8sσ2)2∞∑k=0

(8sσ2)k

= 1 + 128s2σ4 for |s| ≤ 1

16σ2

≤ e128s2σ4

.

Sub-exponential random variables also give rise to exponential deviationinequalities such as Corollary 1.7 (Chernoff bound) or Theorem 1.9 (Hoeffd-ing’s inequality) for weighted sums of independent sub-exponential randomvariables. The significant difference here is that the larger deviations are con-trolled in by a weaker bound.

Bernstein’s inequality

Theorem 1.13 (Bernstein’s inequality). Let X1, . . . , Xn be independent ran-dom variables such that IE(Xi) = 0 and Xi ∼ subE(λ). Define

X =1

n

n∑i=1

Xi ,

Then for any t > 0 we have

IP(X > t) ∨ IP(X < −t) ≤ exp

[−n

2(t2

λ2∧ t

λ)

].

Proof. Without loss of generality, assume that λ = 1 (we can always replaceXi by Xi/λ and t by t/λ). Next, using a Chernoff bound, we get for any s > 0

IP(X > t) ≤n∏i=1

IE[esXi

]e−snt .

1.4. Maximal inequalities 25

Next, if |s| ≤ 1, then IE[esXi

]≤ es

2/2 by definition of sub-exponential distri-butions. It yields

IP(X > t) ≤ ens2

2 −snt

Choosing s = 1 ∧ t yields

IP(X > t) ≤ e−n2 (t2∧t)

We obtain the same bound for IP(X < −t) which concludes the proof.

Note that usually, Bernstein’s inequality refers to a slightly more preciseresult that is qualitatively the same as the one above: it exhibits a Gaussiantail e−nt

2/(2λ2) and an exponential tail e−nt/(2λ). See for example Theorem 2.10in [BLM13].

1.4 MAXIMAL INEQUALITIES

The exponential inequalities of the previous section are valid for linear com-binations of independent random variables, and, in particular, for the averageX. In many instances, we will be interested in controlling the maximum overthe parameters of such linear combinations (this is because of empirical riskminimization). The purpose of this section is to present such results.

Maximum over a finite set

We begin by the simplest case possible: the maximum over a finite set.

Theorem 1.14. Let X1, . . . , XN be N random variables such that Xi ∼ subG(σ2).Then

IE[ max1≤i≤N

Xi] ≤ σ√

2 log(N) , and IE[ max1≤i≤N

|Xi|] ≤ σ√

2 log(2N)

Moreover, for any t > 0,

IP(

max1≤i≤N

Xi > t)≤ Ne−

t2

2σ2 , and IP(

max1≤i≤N

|Xi| > t)≤ 2Ne−

t2

2σ2

Note that the random variables in this theorem need not be independent.


Proof. For any s > 0,

IE[ max1≤i≤N

Xi] =1

sIE[

log esmax1≤i≤N Xi]

≤ 1

slog IE

[esmax1≤i≤N Xi

](by Jensen)

=1

slog IE

[max

1≤i≤NesXi

]≤ 1

slog

∑1≤i≤N

IE[esXi

]≤ 1

slog

∑1≤i≤N

eσ2s2

2

=logN

s+σ2s

2.

Taking s =√

2(logN)/σ2 yields the first inequality in expectation.The first inequality in probability is obtained by a simple union bound:

IP(

max1≤i≤N

Xi > t)

= IP( ⋃

1≤i≤N

{Xi > t})

≤∑

1≤i≤N

IP(Xi > t)

≤ Ne−t2

2σ2 ,

where we used Lemma 1.3 in the last inequality.The remaining two inequalities follow trivially by noting that

max1≤i≤N

|Xi| = max1≤i≤2N

Xi ,

where XN+i = −Xi for i = 1, . . . , N .

Extending these results to a maximum over an infinite set may be impossi-ble. For example, if one is given an infinite sequence ofiid N (0, σ2) random variables X1, X2, . . ., then for any N ≥ 1, we have for anyt > 0,

IP( max1≤i≤N

Xi < t) = [IP(X1 < t)]N → 0 , N →∞ .

On the opposite side of the picture, if all the Xis are equal to the same randomvariable X, we have for any t > 0,

IP( max1≤i≤N

Xi < t) = IP(X1 < t) > 0 ∀N ≥ 1 .

In the Gaussian case, lower bounds are also available. They illustrate the effectof the correlation between the Xis.


Examples from statistics have structure and we encounter many exampleswhere a maximum of random variables over an infinite set is in fact finite. Thisis due to the fact that the random variables that we are considering are notindependent from each other. In the rest of this section, we review some ofthese examples.

Maximum over a convex polytope

We use the definition of a polytope from [Gru03]: a convex polytope P is acompact set with a finite number of vertices V(P) called extreme points. Itsatisfies P = conv(V(P)), where conv(V(P)) denotes the convex hull of thevertices of P.

Let X ∈ IRd be a random vector and consider the (infinite) family of randomvariables

F = {θ>X : θ ∈ P} ,where P ⊂ IRd is a polytope with N vertices. While the family F is infinite, themaximum over F can be reduced to the a finite maximum using the followinguseful lemma.

Lemma 1.15. Consider a linear form x 7→ c>x, x, c ∈ IRd. Then for anyconvex polytope P ⊂ IRd,

maxx∈P

c>x = maxx∈V(P)

c>x

where V(P) denotes the set of vertices of P.

Proof. Assume that V(P) = {v1, . . . , vN}. For any x ∈ P = conv(V(P)), thereexist nonnegative numbers λ1, . . . λN that sum up to 1 and such that x =λ1v1 + · · ·+ λNvN . Thus

c>x = c>( N∑i=1

λivi

)=

N∑i=1

λic>vi ≤

N∑i=1

λi maxx∈V(P)

c>x = maxx∈V(P)

c>x .

It yieldsmaxx∈P

c>x ≤ maxx∈V(P)

c>x ≤ maxx∈P

c>x

so the two quantities are equal.

It immediately yields the following theorem

Theorem 1.16. Let P be a polytope with N vertices v(1), . . . , v(N) ∈ IRd and letX ∈ IRd be a random vector such that [v(i)]>X, i = 1, . . . , N , are sub-Gaussianrandom variables with variance proxy σ2. Then

IE[maxθ∈P

θ>X] ≤ σ√

2 log(N) , and IE[maxθ∈P|θ>X|] ≤ σ

√2 log(2N) .

Moreover, for any t > 0,

IP(

maxθ∈P

θ>X > t)≤ Ne−

t2

2σ2 , and IP(

maxθ∈P|θ>X| > t

)≤ 2Ne−

t2

2σ2


Of particular interest are polytopes that have a small number of vertices.A primary example is the `1 ball of IRd defined by

B1 ={x ∈ IRd :

d∑i=1

|xi| ≤ 1}.

Indeed, it has exactly 2d vertices.

Maximum over the `2 ball

Recall that the unit `2 ball of IRd is defined by the set of vectors u that haveEuclidean norm |u|2 at most 1. Formally, it is defined by

B2 ={x ∈ IRd :

d∑i=1

x2i ≤ 1

}.

Clearly, this ball is not a polytope and yet, we can control the maximum ofrandom variables indexed by B2. This is due to the fact that there exists afinite subset of B2 such that the maximum over this finite set is of the sameorder as the maximum over the entire ball.

Definition 1.17. Fix K ⊂ IRd and ε > 0. A set N is called an ε-net of Kwith respect to a distance d(·, ·) on IRd, if N ⊂ K and for any z ∈ K, thereexists x ∈ N such that d(x, z) ≤ ε.

Therefore, if N is an ε-net of K with respect to a norm ‖ · ‖, then everypoint of K is at distance at most ε from a point in N . Clearly, every compactset admits a finite ε-net. The following lemma gives an upper bound on thesize of the smallest ε-net of B2.

Lemma 1.18. Fix ε ∈ (0, 1). Then the unit Euclidean ball B2 has an ε-net Nwith respect to the Euclidean distance of cardinality |N | ≤ (3/ε)d

Proof. Consider the following iterative construction if the ε-net. Choose x1 =0. For any i ≥ 2, take xi to be any x ∈ B2 such that |x− xj |2 > ε for all j < i.If no such x exists, stop the procedure. Clearly, this will create an ε-net. Wenow control its size.

Observe that since |x−y|2 > ε for all x, y ∈ N , the Euclidean balls centeredat x ∈ N and with radius ε/2 are disjoint. Moreover,⋃

z∈N{z +

ε

2B2} ⊂ (1 +

ε

2)B2

where {z + εB2} = {z + εx , x ∈ B2}. Thus, measuring volumes, we get

vol((1 +

ε

2)B2

)≥ vol

( ⋃z∈N{z +

ε

2B2}

)=∑z∈N

vol({z +

ε

2B2}

)


This is equivalent to (1 +

ε

2

)d≥ |N |

(ε2

)d.

Therefore, we get the following bound

|N | ≤(

1 +2

ε

)d≤(3

ε

)d.

Theorem 1.19. Let X ∈ IRd be a sub-Gaussian random vector with varianceproxy σ2. Then

IE[maxθ∈B2

θ>X] = IE[maxθ∈B2

|θ>X|] ≤ 4σ√d .

Moreover, for any δ > 0, with probability 1− δ, it holds

maxθ∈B2

θ>X = maxθ∈B2

|θ>X| ≤ 4σ√d+ 2σ

√2 log(1/δ) .

Proof. Let N be a 1/2-net of B2 with respect to the Euclidean norm thatsatisfies |N | ≤ 6d. Next, observe that for every θ ∈ B2, there exists z ∈ N andx such that |x|2 ≤ 1/2 and θ = z + x. Therefore,

maxθ∈B2

θ>X ≤ maxz∈N

z>X + maxx∈ 1

2B2

x>X

But

maxx∈ 1

2B2

x>X =1

2maxx∈B2

x>X

Therefore, using Theorem 1.14, we get

IE[maxθ∈B2

θ>X] ≤ 2IE[maxz∈N

z>X] ≤ 2σ√

2 log(|N |) ≤ 2σ√

2(log 6)d ≤ 4σ√d .

The bound with high probability then follows because

IP(

maxθ∈B2

θ>X > t)≤ IP

(2 maxz∈N

z>X > t)≤ |N |e−

t2

8σ2 ≤ 6de−t2

8σ2 .

To conclude the proof, we find t such that

e−t2

8σ2 +d log(6) ≤ δ ⇔ t2 ≥ 8 log(6)σ2d+ 8σ2 log(1/δ) .

Therefore, it is sufficient to take t =√

8 log(6)σ√d+ 2σ

√2 log(1/δ) .

1.5. Sums of independent random matrices 30

1.5 SUMS OF INDEPENDENT RANDOM MATRICES

In this section, we are going to explore how concentration statements can beextended to sums of matrices. As an example, we are going to show a version ofBernstein’s inequality, 1.13, for sums of independent matrices, closely followingthe presentation in [Tro12], which builds on previous work in [AW02]. Resultsof this type have been crucial in providing guarantees for low rank recovery,see for example [Gro11].

In particular, we want to control the maximum eigenvalue of a sum ofindependent random symmetric matrices IP(λmax(

∑i Xi) > t). The tools in-

volved will closely mimic those used for scalar random variables, but with thecaveat that care must be taken when handling exponentials of matrices becauseeA+B 6= eAeB if A,B ∈ IRd and AB 6= BA.

Preliminaries

We denote the set of symmetric, positive semi-definite, and positive definitematrices in IRd×d by Sd,S+

d , and S++d , respectively, and will omit the subscript

when the dimensionality is clear. Here, positive semi-definite means that fora matrix X ∈ S+, v>Xv ≥ 0 for all v ∈ IRd, |v|2 = 1, and positive definitemeans that the equality holds strictly. This is equivalent to all eigenvalues ofX being larger or equal (strictly larger, respectively) than 0, λj(X) ≥ 0 for allj = 1, . . . , d.

The cone of positive definite matrices induces an order on matrices by set-ting A � B if B−A ∈ S+.

Since we want to extend the notion of exponentiating random variables thatwas essential to our derivation of the Chernoff bound to matrices, we will makeuse of the matrix exponential and the matrix logarithm. For a symmetric matrixX, we can define a functional calculus by applying a function f : IR → IR tothe diagonal elements of its spectral decomposition, i.e., if X = QΛQ>, then

f(X) := Qf(Λ)Q>, (1.4)

where[f(Λ)]i,j = f(Λi,j), i, j ∈ [d], (1.5)

is only non-zero on the diagonal and f(X) is well-defined because the spectraldecomposition is unique up to the ordering of the eigenvalues and the basis ofthe eigenspaces, with respect to both of which (1.4) is invariant, as well. Fromthe definition, it is clear that for the spectrum of the resulting matrix,

σ(f(X)) = f(σ(X)). (1.6)

While inequalities between functions not generally carry over to matrices, wehave the following transfer rule:

f(a) ≤ g(a) for a ∈ σ(X) =⇒ f(X) � g(X). (1.7)


We can now define the matrix exponential by (1.4) which is equivalent todefining it via a power series,

exp(X) =

∞∑k=1

1

k!Xk.

Similarly, we can define the matrix logarithm as the inverse function of exp onS, log(eA) = A, which defines it on S+.

Some nice properties of these functions on matrices are their monotonicity:For A � B,

Tr exp A ≤ Tr exp B, (1.8)

and for 0 ≺ A � B,log A � log B. (1.9)

Note that the analogue of (1.9) is in general not true for the matrix exponential.

The Laplace transform method

For the remainder of this section, let X1, . . . ,Xn ∈ IRd×d be independentrandom symmetric matrices.

As a first step that can lead to different types of matrix concentrationinequalities, we give a generalization of the Chernoff bound for the maximumeigenvalue of a matrix.

Proposition 1.20. Let Y be a random symmetric matrix. Then, for all t ∈ IR,

IP(λmax(Y) ≥ t) ≤ infθ>0{e−θtIE[Tr eθY]}

Proof. We multiply both sides of the inequality λmax(Y) ≥ t by θ, take ex-ponentials, apply the spectral theorem (1.6) and then estimate the maximumeigenvalue by the sum over all eigenvalues, the trace.

IP(λmax(Y) ≥ t) = IP(λmax(θY) ≥ θt)= IP(eλmax(θY) ≥ eθt)≤ e−θtIE[eλmax(θY)]

= e−θtIE[λmax(eθY)]

≤ e−θtIE[Tr(eθY)].

Recall that the crucial step in proving Bernstein’s and Hoeffding’s inequal-ity was to exploit the independence of the summands by the fact that theexponential function turns products into sums.

IE[e∑i θXi ] = IE[

∏i

eθXi ] =∏i

IE[eθXi ].


This property, eA+B = eAeB , no longer holds true for matrices, unless theycommute.

We could try to replace it with a similar property, the Golden-Thompsoninequality,

Tr[eθ(X1+X2)] ≤ Tr[eθX1eθX2 ].

Unfortunately, this does not generalize to more than two matrices, and whentrying to peel off factors, we would have to pull a maximum eigenvalue out ofthe trace,

Tr[eθX1eθX2 ] ≤ λmax(eθX2)Tr[eθX1 ].

This is the approach followed by Ahlswede-Winter [AW02], which leads toworse constants for concentration than the ones we obtain below.

Instead, we are going to use the following deep theorem due to Lieb [Lie73].A sketch of the proof can be found in the appendix of [Rus02].

Theorem 1.21 (Lieb). Let H ∈ IRd×d be symmetric. Then,

S+d → IR, A 7→ TreH+log A

is a concave function.

Corollary 1.22. Let X,H ∈ Sd be two symmetric matrices such that X israndom and H is fixed. Then,

IE[TreH+X] ≤ TreH+log IE[eX].

Proof. Write Y = eX and use Jensen’s inequality:

IE[TreH+X] = IE[TreH+log Y]

≤ TreH+log(IE[Y])

= TreH+log IE[eX]

With this, we can establish a better bound on the moment generating func-tion of sums of independent matrices.

Lemma 1.23. Let X1, . . . ,Xn be n independent, random symmetric matrices.Then,

IE[Tr exp(∑i

θXi)] ≤ Tr exp(∑i

log IEeθXi)

Proof. Without loss of generality, we can assume θ = 1. Write IEk for theexpectation conditioned on X1, . . . ,Xk, IEk[ . ] = IE[ . |X1, . . . ,Xk]. By the


tower property,

IE[Tr exp(

n∑i=1

Xi)] = IE0 . . . IEn−1Tr exp(

n∑i=1

Xi)

≤ IE0 . . . IEn−2Tr exp( n−1∑i=1

Xi + log IEn−1eXn︸︷︷︸

=IEeXn

)...

≤ Tr exp(

n∑i=1

log IE[eXi ]),

where we applied Lieb’s theorem, Theorem 1.21 on the conditional expectationand then used the independence of X1, . . . ,Xn.

Tail bounds for sums of independent matrices

Theorem 1.24 (Master tail bound). For all t ∈ IR,

IP(λmax(∑i

Xi) ≥ t) ≤ infθ>0{e−θtTr exp(

∑i

log IEeθXi)}

Proof. Combine Lemma 1.23 with Proposition 1.20.

Corollary 1.25. Assume that there is a function g : (0,∞)→ [0,∞] and fixedsymmetric matrices Ai such that IE[eθXi ] � eg(θ)Ai for all i. Set

ρ = λmax(∑i

Ak).

Then, for all t ∈ IR,

IP(λmax(∑i

Xi) ≥ t) ≤ d infθ>0{e−θt+g(θ)ρ}.

Proof. Using the operator monoticity of log, (1.9), and the monotonicity ofTr exp, (1.8), we can plug the estimates for the matrix mgfs into the masterinequality, Theorem 1.24, to obtain

IP(λmax(∑i

Xi) ≥ t) ≤ e−θtTr exp(g(θ)∑i

Ai)

≤ de−θtλmax(exp(g(θ)∑i

Ai))

= de−θt exp(g(θ)λmax(∑i

Ai)),

where we estimated Tr(X) =∑dj=1 λj ≤ dλmax and used the spectral theorem.


Now we are ready to prove Bernstein’s inequality We just need to come upwith a dominating function for Corollary 1.25.

Lemma 1.26. If IE[X] = 0 and λmax(X) ≤ 1, then

IE[eθX] � exp((eθ − θ − 1)IE[X2]).

Proof. Define

f(x) =

eθx − θx− 1

x2, x 6= 0,

θ2

2, x = 0,

and verify that this is a smooth and increasing function on IR. Hence, f(x) ≤f(1) for x ≤ 1. By the transfer rule, (1.7), f(X) � f(I) = f(1)I. Therefore,

eθX = I + θX + Xf(X)X � I + θX + f(1)X2,

and by taking expectations,

IE[eθX] � I + f(1)IE[X2] � exp(f(1)IE[X2]) = exp((eθ − θ − 1)IE[X2]).

Theorem 1.27 (Matrix Bernstein). Assume IEXi = 0 and λmax(Xi) ≤ Ralmost surely for all i and set

σ2 = ‖∑i

IE[X2i ]‖op.

Then, for all t ≥ 0,

IP(λmax(∑i

Xi) ≥ t) ≤ d exp(− σ2

R2h(Rt/σ2))

≤ d exp

(−t2/2

σ2 +Rt/3

)

≤

d exp

(− 3t2

8σ2

), t ≤ σ2

R,

d exp

(− 3t

8R

), t ≥ σ2

R,

where h(u) = (1 + u) log(1 + u)− u, u > 0.

Proof. Without loss of generality, take R = 1. By Lemma 1.26,

IE[eθXi ] � exp(g(θ)IE[X2i ]), with g(θ) = eθ − θ − 1, θ > 0.

By Corollary 1.25,

IP(λmax(∑i

Xi) ≥ t) ≤ d exp(−θt+ g(θ)σ2).


We can verify that the minimal value is attained at θ = log(1 + t/σ2).

The second inequality then follows from the fact that h(u) ≥ h2(u) = u2/21+u/3

for u ≥ 0 which can be verified by comparing derivatives.The third inequality follows from exploting the properties of h2.

With similar techniques, one can also prove a version of Hoeffding’s inequal-ity for matrices, see [Tro12, Theorem 1.3].

1.6. Problem set 36

1.6 PROBLEM SET

Problem 1.1. Let X1, . . . , Xn be independent random variables such thatIE(Xi) = 0 and Xi ∼ subE(λ). For any vector a = (a1, . . . , an)> ∈ IRn, definethe weighted sum

S(a) =

n∑i=1

aiXi ,

Show that for any t > 0 we have

IP(|S(a)| > t) ≤ 2 exp

[−C( t2

λ2|a|22∧ t

λ|a|∞)].

for some positive constant C.

Problem 1.2. A random variable X has χ2n (chi-squared with n degrees of

freedom) if it has the same distribution as Z21 + . . .+Z2

n, where Z1, . . . , Zn arei.i.d. N (0, 1).

(a) Let Z ∼ N (0, 1). Show that the moment generating function of Y =Z2 − 1 satisfies

φ(s) := E[esY]

=

e−s√1− 2s

if s < 1/2

∞ otherwise

(b) Show that for all 0 < s < 1/2,

φ(s) ≤ exp( s2

1− 2s

).

(c) Conclude thatIP(Y > 2t+ 2

√t) ≤ e−t

[Hint: you can use the convexity inequality√

1 + u ≤ 1+u/2].

(d) Show that if X ∼ χ2n, then, with probability at least 1− δ, it holds

X ≤ n+ 2√n log(1/δ) + 2 log(1/δ) .

Problem 1.3. Let X1, X2 . . . be an infinite sequence of sub-Gaussian randomvariables with variance proxy σ2

i = C(log i)−1. Show that for C large enough,we get

IE[

maxi≥2

Xi

]<∞ .

1.6. Problem set 37

Problem 1.4. Let A = {Ai,j} 1≤i≤n1≤j≤m

be a random matrix such that its entries

are i.i.d. sub-Gaussian random variables with variance proxy σ2.

(a) Show that the matrix A is sub-Gaussian. What is its variance proxy?

(b) Let ‖A‖ denote the operator norm of A defined by

maxx∈IRm

|Ax|2|x|2

.

Show that there exits a constant C > 0 such that

IE‖A‖ ≤ C(√m+

√n) .

Problem 1.5. Recall that for any q ≥ 1, the `q norm of a vector x ∈ IRn isdefined by

|x|q =( n∑i=1

|xi|q) 1q

.

Let X = (X1, . . . , Xn) be a vector with independent entries such that Xi issub-Gaussian with variance proxy σ2 and IE(Xi) = 0.

(a) Show that for any q ≥ 2, and any x ∈ IRd,

|x|2 ≤ |x|qn12−

1q ,

and prove that the above inequality cannot be improved

(b) Show that for for any q > 1,

IE|X|q ≤ 4σn1q√q

(c) Recover from this bound that

IE max1≤i≤n

|Xi| ≤ 4eσ√

log n .

Problem 1.6. Let K be a compact subset of the unit sphere of IRp thatadmits an ε-net Nε with respect to the Euclidean distance of IRp that satisfies|Nε| ≤ (C/ε)d for all ε ∈ (0, 1). Here C ≥ 1 and d ≤ p are positive constants.Let X ∼ subGp(σ

2) be a centered random vector.Show that there exists positive constants c1 and c2 to be made explicit such

that for any δ ∈ (0, 1), it holds

maxθ∈K

θ>X ≤ c1σ√d log(2p/d) + c2σ

√log(1/δ)

with probability at least 1−δ. Comment on the result in light of Theorem 1.19 .

1.6. Problem set 38

Problem 1.7. For any K ⊂ IRd, distance d on IRd and ε > 0, the ε-coveringnumber C(ε) of K is the cardinality of the smallest ε-net of K. The ε-packingnumber P (ε) of K is the cardinality of the largest set P ⊂ K such thatd(z, z′) > ε for all z, z′ ∈ P, z 6= z′. Show that

C(2ε) ≤ P (2ε) ≤ C(ε) .

Problem 1.8. Let X1, . . . , Xn be n independent and random variables suchthat IE[Xi] = µ and var(Xi) ≤ σ2. Fix δ ∈ (0, 1) and assume without loss ofgenerality that n can be factored into n = K · G where G = 8 log(1/δ) is apositive integers.

For g = 1, . . . , G, let Xg denote the average over the gth group of k variables.Formally

Xg =1

k

gk∑i=(g−1)k+1

Xi .

1. Show that for any g = 1, . . . , G,

IP[Xg − µ >

2σ√k

]≤ 1

4.

2. Let µ be defined as the median of {X1, . . . , XG}. Show that

IP[µ− µ > 2σ√

k

]≤ IP

[B ≥ G

2

],

where B ∼ Bin(G, 1/4).

3. Conclude that

IP[µ− µ > 4σ

√2 log(1/δ)

n

]≤ δ

4. Compare this result with Corollary 1.7 and Lemma 1.3. Can you concludethat µ− µ ∼ subG(σ2/n) for some σ2? Conclude.

Problem 1.9. The goal of this problem is to prove the following theorem:

Theorem 1.28 (Johnson-Lindenstrauss Lemma). Given n points X = {x1, . . . , xn}in IRd, let Q ∈ IRk×d be a random projection operator and set P :=

√dkQ.

There is a constant C > 0 such that if

k ≥ C

ε2log n,

P is an ε-isometry for X, i.e.,

(1− ε)‖xi − xj‖22 ≤ ‖Pxi − Pxj‖22 ≤ (1 + ε)‖xi − xj‖22, for all i, j

with propability at least 1− 2 exp(−cε2k).

1.6. Problem set 39

1. Convince yourself that if d > n, there is a projection P ∈ IRn×d to an ndimensional subspace such that ‖Pxi−Pxj‖2 = ‖xi−xj‖2, i.e., pairwisedistances are exactly preserved.

Let k ≤ d be two integers, Y = (y1, . . . , yd) ∼ N (0, Id×d) independent andidentically distributed Gaussians and Q ∈ IRd×k the projection onto the first kcoordinates, i.e., Qy = (y1, . . . , yk). Define Z = 1

‖Y ‖QY = 1‖Y ‖ (y1, . . . , yk) and

L = ‖Z‖2.

2. Show that IE[L] = kd .

3. Show that for all t > 0 such that 1− 2t(kβ − d) > 0 and 1− 2tβk > 0,

IP

(k∑i=1

y2i ≤ β

k

d

d∑i=1

y2i

)≤ (1− 2t(kβ − d))−k/2(1− 2tβk)−(d−k)/2

(Hint: Show that IE[eλX2

] = 1√1−2λ

for λ < 12 if X ∼ N (0, 1).)

4. Conclude that for β < 1,

IP

(L ≤ β k

d

)≤ exp

(k

2(1− β + log β)

).

5. Similarly, show that for β > 1,

IP

(L ≥ β k

d

)≤ exp

(k

2(1− β + log β)

).

6. Show that for a random projection operator Q ∈ IRk×d and a fixed vectorx ∈ IRd,

a) IE[‖Qx‖2] = kd‖x‖

2.

b) For ε ∈ (0, 1), there is a constant c > 0 such that with probabilityat least 1− 2 exp(−ckε2),

(1− ε)kd‖x‖2 ≤ ‖Qx‖22 ≤ (1 + ε)

k

d‖x‖22.

(Hint: Think about how to apply the previous results in this caseand use the inequalities log(1 − ε) ≤ −ε − ε2/2 and log(1 + ε) ≤ε− ε2/2 + ε3/3.)

7. Prove Theorem 1.28.

Chapter

2Linear Regression Model

In this chapter, we consider the following regression model:

Yi = f(Xi) + εi, i = 1, . . . , n , (2.1)

where ε = (ε1, . . . , εn)> is sub-Gaussian with variance proxy σ2 and such thatIE[ε] = 0. Our goal is to estimate the function f under a linear assumption.Namely, we assume that x ∈ IRd and f(x) = x>θ∗ for some unknown θ∗ ∈ IRd.

2.1 FIXED DESIGN LINEAR REGRESSION

Depending on the nature of the design points X1, . . . , Xn, we will favor adifferent measure of risk. In particular, we will focus either on fixed or randomdesign.

Random design

The case of random design corresponds to the statistical learning setup. Let(X1, Y1), . . . , (Xn+1, Yn+1) be n + 1 i.i.d. random couples. Given the pairs

(X1, Y1), . . . , (Xn, Yn), the goal is construct a function fn such that fn(Xn+1)

is a good predictor of Yn+1. Note that when fn is constructed, Xn+1 is stillunknown and we have to account for what value it is likely to take.

Consider the following example from [HTF01, Section 3.2]. The responsevariable Y is the log-volume of a cancerous tumor, and the goal is to predictit based on X ∈ IR6, a collection of variables that are easier to measure (ageof patient, log-weight of prostate, . . . ). Here the goal is clearly to construct ffor prediction purposes. Indeed, we want to find an automatic mechanism that

40

2.1. Fixed design linear regression 41

outputs a good prediction of the log-weight of the tumor given certain inputsfor a new (unseen) patient.

A natural measure of performance here is the L2-risk employed in the in-troduction:

R(fn) = IE[Yn+1 − fn(Xn+1)]2 = IE[Yn+1 − f(Xn+1)]2 + ‖fn − f‖2L2(PX) ,

where PX denotes the marginal distribution of Xn+1. It measures how goodthe prediction of Yn+1 is in average over realizations of Xn+1. In particular,it does not put much emphasis on values of Xn+1 that are not very likely tooccur.

Note that if the εi are random variables with variance σ2, then one simplyhas R(fn) = σ2 + ‖fn − f‖2L2(PX). Therefore, for random design, we will focus

on the squared L2 norm ‖fn− f‖2L2(PX) as a measure of accuracy. It measures

how close fn is to the unknown f in average over realizations of Xn+1.

Fixed design

In fixed design, the points (or vectors) X1, . . . , Xn are deterministic. To em-phasize this fact, we use lowercase letters x1, . . . , xn to denote fixed design. Ofcourse, we can always think of them as realizations of a random variable butthe distinction between fixed and random design is deeper and significantlyaffects our measure of performance. Indeed, recall that for random design, welook at the performance in average over realizations of Xn+1. Here, there is nosuch thing as a marginal distribution of Xn+1. Rather, since the design pointsx1, . . . , xn are considered deterministic, our goal is estimate f only at thesepoints. This problem is sometimes called denoising since our goal is to recoverf(x1), . . . , f(xn) given noisy observations of these values.

In many instances, fixed designs exhibit particular structures. A typicalexample is the regular design on [0, 1], given by xi = i/n, i = 1, . . . , n. Inter-polation between these points is possible under smoothness assumptions.

Note that in fixed design, we observe µ∗+ε, where µ∗ =(f(x1), . . . , f(xn)

)> ∈IRn and ε = (ε1, . . . , εn)> ∈ IRn is sub-Gaussian with variance proxy σ2. In-stead of a functional estimation problem, it is often simpler to view this problemas a vector problem in IRn. This point of view will allow us to leverage theEuclidean geometry of IRn.

In the case of fixed design, we will focus on the Mean Squared Error (MSE)as a measure of performance. It is defined by

MSE(fn) =1

n

n∑i=1

(fn(xi)− f(xi)

)2.

Equivalently, if we view our problem as a vector problem, it is defined by

MSE(µ) =1

n

n∑i=1

(µi − µ∗i

)2=

1

n|µ− µ∗|22 .

2.2. Least squares estimators 42

Often, the design vectors x1, . . . , xn ∈ IRd are stored in an n× d design matrixX, whose jth row is given by x>j . With this notation, the linear regressionmodel can be written as

Y = Xθ∗ + ε , (2.2)

where Y = (Y1, . . . , Yn)> and ε = (ε1, . . . , εn)>. Moreover,

MSE(Xθ) =1

n|X(θ − θ∗)|22 = (θ − θ∗)>X

>Xn

(θ − θ∗) . (2.3)

A natural example of fixed design regression is image denoising. Assumethat µ∗i , i ∈ 1, . . . , n is the grayscale value of pixel i of an image. We do notget to observe the image µ∗ but rather a noisy version of it Y = µ∗ + ε. Givena library of d images {x1, . . . , xd}, xj ∈ IRn, our goal is to recover the originalimage µ∗ using linear combinations of the images x1, . . . , xd. This can be donefairly accurately (see Figure 2.1).

Figure 2.1. Reconstruction of the digit “6”: Original (left), Noisy (middle) and Recon-struction (right). Here n = 16× 16 = 256 pixels. Source [RT11].

As we will see in Remark 2.3, properly choosing the design also ensures thatif MSE(f) is small for some linear estimator f(x) = x>θ, then |θ − θ∗|22 is alsosmall.

In this chapter we only consider the fixed design case.

2.2 LEAST SQUARES ESTIMATORS

Throughout this section, we consider the regression model (2.2) with fixeddesign.

Unconstrained least squares estimator

Define the (unconstrained) least squares estimator θls to be any vector suchthat

θls ∈ argminθ∈IRd

|Y − Xθ|22 .


Note that we are interested in estimating Xθ∗ and not θ∗ itself, so by exten-sion, we also call µls = Xθls least squares estimator. Observe that µls is theprojection of Y onto the column span of X.

It is not hard to see that least squares estimators of θ∗ and µ∗ = Xθ∗ aremaximum likelihood estimators when ε ∼ N (0, σ2In).

Proposition 2.1. The least squares estimator µls = Xθls ∈ IRn satisfies

X>µls = X>Y .

Moreover, θls can be chosen to be

θls = (X>X)†X>Y ,

where (X>X)† denotes the Moore-Penrose pseudoinverse of X>X.

Proof. The function θ 7→ |Y − Xθ|22 is convex so any of its minima satisfies

∇θ|Y − Xθ|22 = 0,

where ∇θ is the gradient operator. Using matrix calculus, we find

∇θ|Y − Xθ|22 = ∇θ{|Y |22 − 2Y >Xθ + θ>X>Xθ

}= −2(Y >X− θ>X>X)> .

Therefore, solving ∇θ|Y − Xθ|22 = 0 yields

X>Xθ = X>Y .

It concludes the proof of the first statement. The second statement followsfrom the definition of the Moore-Penrose pseudoinverse.

We are now going to prove our first result on the finite sample performanceof the least squares estimator for fixed design.

Theorem 2.2. Assume that the linear model (2.2) holds where ε ∼ subGn(σ2).

Then the least squares estimator θls satisfies

IE[MSE(Xθls)

]=

1

nIE|Xθls − Xθ∗|22 . σ2 r

n,

where r = rank(X>X). Moreover, for any δ > 0, with probability at least 1− δ,it holds

MSE(Xθls) . σ2 r + log(1/δ)

n.

Proof. Note that by definition

|Y − Xθls|22 ≤ |Y − Xθ∗|22 = |ε|22 . (2.4)

Moreover,

|Y − Xθls|22 = |Xθ∗ + ε− Xθls|22 = |Xθls − Xθ∗|22 − 2ε>X(θls − θ∗) + |ε|22 .


Therefore, we get

|Xθls − Xθ∗|22 ≤ 2ε>X(θls − θ∗) = 2|Xθls − Xθ∗|2ε>X(θls − θ∗)|X(θls − θ∗)|2

(2.5)

Note that it is difficult to control

ε>X(θls − θ∗)|X(θls − θ∗)|2

as θls depends on ε and the dependence structure of this term may be compli-cated. To remove this dependency, a traditional technique is to “sup-out” θls.This is typically where maximal inequalities are needed. Here we have to be abit careful.

Let Φ = [φ1, . . . , φr] ∈ IRn×r be an orthonormal basis of the column span

of X. In particular, there exists ν ∈ IRr such that X(θls − θ∗) = Φν. It yields

ε>X(θls − θ∗)|X(θls − θ∗)|2

=ε>Φν

|Φν|2=ε>Φν

|ν|2= ε>

ν

|ν|2≤ supu∈B2

ε>u ,

where B2 is the unit ball of IRr and ε = Φ>ε. Thus

|Xθls − Xθ∗|22 ≤ 4 supu∈B2

(ε>u)2 ,

Next, let u ∈ Sr−1 and note that |Φu|22 = u>Φ>Φu = u>u = 1, so that for anys ∈ IR, we have

IE[esε>u] = IE[esε

>Φu] ≤ es2σ2

2 .

Therefore, ε ∼ subGr(σ2).

To conclude the bound in expectation, observe that Lemma 1.4 yields

4IE[

supu∈B2

(ε>u)2]

= 4

r∑i=1

IE[ε2i ] ≤ 16σ2r .

Moreover, with probability 1− δ, it follows from the last step in the proof1 ofTheorem 1.19 that

supu∈B2

(ε>u)2 ≤ 8 log(6)σ2r + 8σ2 log(1/δ) .

Remark 2.3. If d ≤ n and B := X>Xn has rank d, then we have

|θls − θ∗|22 ≤MSE(Xθls)λmin(B)

,

and we can use Theorem 2.2 to bound |θls − θ∗|22 directly. By contrast, in thehigh-dimensional case, we will need more structural assumptions to come to asimilar conclusion.

1we could use Theorem 1.19 directly here but at the cost of a factor 2 in the constant.


Constrained least squares estimator

Let K ⊂ IRd be a symmetric convex set. If we know a priori that θ∗ ∈ K, wemay prefer a constrained least squares estimator θlsK defined by

θlsK ∈ argminθ∈K

|Y − Xθ|22 .

The fundamental inequality (2.4) would still hold and the bounds on the MSEmay be smaller. Indeed, (2.5) can be replaced by

|XθlsK − Xθ∗|22 ≤ 2ε>X(θlsK − θ∗) ≤ 2 supθ∈K−K

(ε>Xθ) ,

where K −K = {x − y : x, y ∈ K}. It is easy to see that if K is symmetric(x ∈ K ⇔ −x ∈ K) and convex, then K −K = 2K so that

2 supθ∈K−K

(ε>Xθ) = 4 supv∈XK

(ε>v)

where XK = {Xθ : θ ∈ K} ⊂ IRn. This is a measure of the size (width) ofXK. If ε ∼ N (0, Id), the expected value of the above supremum is actuallycalled Gaussian width of XK. While ε is not Gaussian but sub-Gaussian here,similar properties will still hold.

`1 constrained least squares

Assume here that K = B1 is the unit `1 ball of IRd. Recall that it is defined by

B1 ={x ∈ IRd :

d∑i=1

|xi| ≤ 1},

and it has exactly 2d vertices V = {e1,−e1, . . . , ed,−ed}, where ej is the j-thvector of the canonical basis of IRd and is defined by

ej = (0, . . . , 0, 1︸︷︷︸jth position

, 0, . . . , 0)> .

It implies that the set XK = {Xθ, θ ∈ K} ⊂ IRn is also a polytope with atmost 2d vertices that are in the set XV = {X1,−X1, . . . ,Xd,−Xd} where Xj isthe j-th column of X. Indeed, XK is a obtained by rescaling and embedding(resp. projecting) the polytope K when d ≤ n (resp., d ≥ n). Note that somecolumns of X might not be vertices of XK so that XV might be a strict supersetof the set of vertices of XK.

Theorem 2.4. Let K = B1 be the unit `1 ball of IRd, d ≥ 2 and assume thatθ∗ ∈ B1. Moreover, assume the conditions of Theorem 2.2 and that the columnsof X are normalized in such a way that maxj |Xj |2 ≤

√n. Then the constrained

least squares estimator θlsB1satisfies

IE[MSE(XθlsB1

)]

=1

nIE|XθlsB1

− Xθ∗|22 . σ

√log d

n,


Moreover, for any δ > 0, with probability 1− δ, it holds

MSE(XθlsB1) . σ

√log(d/δ)

n.

Proof. From the considerations preceding the theorem, we get that

|XθlsB1− Xθ∗|22 ≤ 4 sup

v∈XK(ε>v).

Observe now that since ε ∼ subGn(σ2), for any column Xj such that |Xj |2 ≤√n, the random variable ε>Xj ∼ subG(nσ2). Therefore, applying Theo-

rem 1.16, we get the bound on IE[MSE(XθlsK)

]and for any t ≥ 0,

IP[MSE(XθlsK) > t

]≤ IP

[supv∈XK

(ε>v) > nt/4]≤ 2de−

nt2

32σ2

To conclude the proof, we find t such that

2de−nt2

32σ2 ≤ δ ⇔ t2 ≥ 32σ2 log(2d)

n+ 32σ2 log(1/δ)

n.

Note that the proof of Theorem 2.2 also applies to θlsB1(exercise!) so that

XθlsB1benefits from the best of both rates,

IE[MSE(XθlsB1

)]. min

(σ2 r

n, σ

√log d

n

).

This is called an elbow effect. The elbow takes place around r '√n (up to

logarithmic terms).

`0 constrained least squares

By an abuse of notation, we call the number of non-zero coefficients of a vectorθ ∈ IRd its `0 norm (it is not actually a norm). It is denoted by

|θ|0 =

d∑j=1

1I(θj 6= 0) .

We call a vector θ with “small” `0 norm a sparse vector. More precisely, if|θ|0 ≤ k, we say that θ is a k-sparse vector. We also call support of θ the set

supp(θ) ={j ∈ {1, . . . , d} : θj 6= 0

}so that |θ|0 = card(supp(θ)) =: | supp(θ)| .


Remark 2.5. The `0 terminology and notation comes from the fact that

limq→0+

d∑j=1

|θj |q = |θ|0

Therefore it is really limq→0+ |θ|qq but the notation |θ|00 suggests too much thatit is always equal to 1.

By extension, denote by B0(k) the `0 ball of IRd, i.e., the set of k-sparsevectors, defined by

B0(k) = {θ ∈ IRd : |θ|0 ≤ k} .

In this section, our goal is to control the MSE of θlsK when K = B0(k). Note that

computing θlsB0(k) essentially requires computing(dk

)least squares estimators,

which is an exponential number in k. In practice this will be hard (or evenimpossible) but it is interesting to understand the statistical properties of thisestimator and to use them as a benchmark.

Theorem 2.6. Fix a positive integer k ≤ d/2. Let K = B0(k) be set ofk-sparse vectors of IRd and assume that θ∗ ∈ B0(k). Moreover, assume theconditions of Theorem 2.2. Then, for any δ > 0, with probability 1− δ, it holds

MSE(XθlsB0(k)) .σ2

nlog

(d

2k

)+σ2k

n+σ2

nlog(1/δ) .

Proof. We begin as in the proof of Theorem 2.2 to get (2.5):

|XθlsK − Xθ∗|22 ≤ 2ε>X(θlsK − θ∗) = 2|XθlsK − Xθ∗|2ε>X(θlsK − θ∗)|X(θlsK − θ∗)|2

.

We know that both θlsK and θ∗ are in B0(k) so that θlsK − θ∗ ∈ B0(2k). Forany S ⊂ {1, . . . , d}, let XS denote the n× |S| submatrix of X that is obtainedfrom the columns of Xj , j ∈ S of X. Denote by rS ≤ |S| the rank of XS andlet ΦS = [φ1, . . . , φrS ] ∈ IRn×rS be an orthonormal basis of the column spanof XS . Moreover, for any θ ∈ IRd, define θ(S) ∈ IR|S| to be the vector with

coordinates θj , j ∈ S. If we denote by S = supp(θlsK − θ∗), we have |S| ≤ 2kand there exists ν ∈ IRrS such that

X(θlsK − θ∗) = XS(θlsK(S)− θ∗(S)) = ΦSν .

Therefore,

ε>X(θlsK − θ∗)|X(θlsK − θ∗)|2

=ε>ΦSν

|ν|2≤ max|S|=2k

supu∈BrS2

[ε>ΦS ]u

where BrS2 is the unit ball of IRrS . It yields

|XθlsK − Xθ∗|22 ≤ 4 max|S|=2k

supu∈BrS2

(ε>S u)2 ,


with εS = Φ>S ε ∼ subGrS (σ2).Using a union bound, we get for any t > 0,

IP(

max|S|=2k

supu∈BrS2

(ε>u)2 > t)≤∑|S|=2k

IP(

supu∈BrS2

(ε>u)2 > t)

It follows from the proof of Theorem 1.19 that for any |S| ≤ 2k,

IP(

supu∈BrS2

(ε>u)2 > t)≤ 6|S|e−

t8σ2 ≤ 62ke−

t8σ2 .

Together, the above three displays yield

IP(|XθlsK − Xθ∗|22 > 4t) ≤(d

2k

)62ke−

t8σ2 . (2.6)

To ensure that the right-hand side of the above inequality is bounded by δ, weneed

t ≥ Cσ2{

log

(d

2k

)+ k log(6) + log(1/δ)

}.

How large is log(d2k

)? It turns out that it is not much larger than k.

Lemma 2.7. For any integers 1 ≤ k ≤ n, it holds(n

k

)≤(enk

)k.

Proof. Observe first that if k = 1, since n ≥ 1, it holds,(n

1

)= n ≤ en =

(en1

)1

Next, we proceed by induction and assume that it holds for some k ≤ n − 1that (

n

k

)≤(enk

)k.

Observe that(n

k + 1

)=

(n

k

)n− kk + 1

≤(enk

)k n− kk + 1

≤ eknk+1

(k + 1)k+1

(1 +

1

k

)k,

where we used the induction hypothesis in the first inequality. To conclude, itsuffices to observe that (

1 +1

k

)k≤ e.

It immediately leads to the following corollary:

2.3. The Gaussian Sequence Model 49

Corollary 2.8. Under the assumptions of Theorem 2.6, for any δ > 0, withprobability at least 1− δ, it holds

MSE(XθlsB0(k)) .σ2k

nlog( ed

2k

)+σ2k

nlog(6) +

σ2

nlog(1/δ) .

Note that for any fixed δ, there exits a constant Cδ > 0 such that for anyn ≥ 2k, with high probability,

MSE(XθlsB0(k)) ≤ Cδσ2k

nlog( ed

2k

).

Comparing this result with Theorem 2.2 with r = k, we see that the price topay for not knowing the support of θ∗ but only its size, is a logarithmic factorin the dimension d.

This result immediately leads the following bound in expectation.

Corollary 2.9. Under the assumptions of Theorem 2.6,

IE[MSE(XθlsB0(k))

].σ2k

nlog(edk

).

Proof. It follows from (2.6) that for any H ≥ 0,

IE[MSE(XθlsB0(k))

]=

∫ ∞0

IP(|XθlsK − Xθ∗|22 > nu)du

≤ H +

∫ ∞0

IP(|XθlsK − Xθ∗|22 > n(u+H))du

≤ H +

2k∑j=1

(d

j

)62k

∫ ∞0

e−n(u+H)

32σ2 du

= H +

2k∑j=1

(d

j

)62ke−

nH32σ2

32σ2

n.

Next, take H to be such that

2k∑j=1

(d

j

)62ke−

nH32σ2 = 1 .

This yields

H .σ2k

nlog(edk

),

which completes the proof.

2.3 THE GAUSSIAN SEQUENCE MODEL

The Gaussian Sequence Model is a toy model that has received a lot ofattention, mostly in the eighties. The main reason for its popularity is that


it carries already most of the insight of nonparametric estimation. While themodel looks very simple it allows to carry deep ideas that extend beyond itsframework and in particular to the linear regression model that we are inter-ested in. Unfortunately we will only cover a small part of these ideas andthe interested reader should definitely look at the excellent books by A. Tsy-bakov [Tsy09, Chapter 3] and I. Johnstone [Joh11].

The model is as follows:

Yi = θ∗i + εi , i = 1, . . . , d , (2.7)

where ε1, . . . , εd are i.i.d. N (0, σ2) random variables. Note that often, d istaken equal to ∞ in this sequence model and we will also discuss this case. Itslinks to nonparametric estimation will become clearer in Chapter 3. The goalhere is to estimate the unknown vector θ∗.

The sub-Gaussian Sequence Model

Note first that the model (2.7) is a special case of the linear model with fixeddesign (2.1) with n = d and f(xi) = x>i θ

∗, where x1, . . . , xn form the canonicalbasis e1, . . . , en of IRn and ε has a Gaussian distribution. Therefore, n = d isboth the dimension of the parameter θ and the number of observations and itlooks like we have chosen to index this problem by d rather than n somewhatarbitrarily. We can bring n back into the picture, by observing that this modelencompasses slightly more general choices for the design matrix X as long asit satisfies the following assumption.

Assumption ORT The design matrix satisfies

X>Xn

= Id ,

where Id denotes the identity matrix of IRd.

Assumption ORT allows for cases where d ≤ n but not d > n (high dimensionalcase) because of obvious rank constraints. In particular, it means that the dcolumns of X are orthogonal in IRn and all have norm

√n.

Under this assumption, it follows from the linear regression model (2.2)that

y :=1

nX>Y =

X>Xn

θ∗ +1

nX>ε

= θ∗ + ξ ,

where ξ = (ξ1, . . . , ξd) ∼ subGd(σ2/n) (If ε is Gaussian, we even have ξ ∼

Nd(0, σ2/n)). As a result, under the assumption ORT, when ξ is Gaussian, thelinear regression model (2.2) is equivalent to the Gaussian Sequence Model (2.7)up to a transformation of the data Y and a rescaling of the variance. Moreover,for any estimator θ ∈ IRd, under ORT, it follows from (2.3) that

MSE(Xθ) = (θ − θ∗)>X>Xn

(θ − θ∗) = |θ − θ∗|22 .


Furthermore, for any θ ∈ IRd, the assumption ORT yields,

|y − θ|22 = | 1nX>Y − θ|22

= |θ|22 −2

nθ>X>Y +

1

n2Y >XX>Y

=1

n|Xθ|22 −

2

n(Xθ)>Y +

1

n|Y |22 +Q

=1

n|Y − Xθ|22 +Q , (2.8)

where Q is a constant that does not depend on θ and is defined by

Q =1

n2Y >XX>Y − 1

n|Y |22

This implies in particular that the least squares estimator θls is equal to y.

We introduce a sightly more general model called sub-Gaussian sequencemodel :

y = θ∗ + ξ ∈ IRd , (2.9)

where ξ ∼ subGd(σ2/n).

In this section, we can actually completely “forget” about our originalmodel (2.2). In particular we can define this model independently of Assump-tion ORT and thus for any values of n and d.

The sub-Gaussian sequence model and the Gaussian sequence model arecalled direct (observation) problems as opposed to inverse problems where thegoal is to estimate the parameter θ∗ only from noisy observations of its imagethrough an operator. The linear regression model is one such inverse problemwhere the matrix X plays the role of a linear operator. However, in these notes,we never try to invert the operator. See [Cav11] for an interesting survey onthe statistical theory of inverse problems.

Sparsity adaptive thresholding estimators

If we knew a priori that θ was k-sparse, we could directly employ Corollary 2.8to obtain that with probability at least 1− δ, we have

MSE(XθlsB0(k)) ≤ Cδσ2k

nlog( ed

2k

).

As we will see, the assumption ORT gives us the luxury to not know k and yetadapt to its value. Adaptation means that we can construct an estimator thatdoes not require the knowledge of k (the smallest such that |θ∗|0 ≤ k) and yet

perform as well as θlsB0(k), up to a multiplicative constant.Let us begin with some heuristic considerations to gain some intuition.

Assume the sub-Gaussian sequence model (2.9). If nothing is known about θ∗


it is natural to estimate it using the least squares estimator θls = y. In thiscase,

MSE(Xθls) = |y − θ∗|22 = |ξ|22 ≤ Cδσ2d

n,

where the last inequality holds with probability at least 1− δ. This is actuallywhat we are looking for if k = Cd for some positive constant C ≤ 1. Theproblem with this approach is that it does not use the fact that k may be muchsmaller than d, which happens when θ∗ has many zero coordinates.

If θ∗j = 0, then, yj = ξj , which is a sub-Gaussian random variable with vari-

ance proxy σ2/n. In particular, we know from Lemma 1.3 that with probabilityat least 1− δ,

|ξj | ≤ σ√

2 log(2/δ)

n= τ . (2.10)

The consequences of this inequality are interesting. One the one hand, if weobserve |yj | � τ , then it must correspond to θ∗j 6= 0. On the other hand, if|yj | ≤ τ is smaller, then, θ∗j cannot be very large. In particular, by the triangleinequality, |θ∗j | ≤ |yj |+ |ξj | ≤ 2τ . Therefore, we loose at most 2τ by choosing

θj = 0. It leads us to consider the following estimator.

Definition 2.10. The hard thresholding estimator with threshold 2τ > 0is denoted by θhrd and has coordinates

θhrdj =

{yj if |yj | > 2τ ,0 if |yj | ≤ 2τ ,

for j = 1, . . . , d. In short, we can write θhrdj = yj1I(|yj | > 2τ).

From our above consideration, we are tempted to choose τ as in (2.10).Yet, this threshold is not large enough. Indeed, we need to choose τ such that|ξj | ≤ τ simultaneously for all j. This can be done using a maximal inequality.Namely, Theorem 1.14 ensures that with probability at least 1− δ,

max1≤j≤d

|ξj | ≤ σ√

2 log(2d/δ)

n.

It yields the following theorem.

Theorem 2.11. Consider the linear regression model (2.2) under the assump-tion ORT or, equivalenty, the sub-Gaussian sequence model (2.9). Then the

hard thresholding estimator θhrd with threshold

2τ = 2σ

√2 log(2d/δ)

n(2.11)

enjoys the following two properties on the same event A such that IP(A) ≥ 1−δ:

(i) If |θ∗|0 = k,

MSE(Xθhrd) = |θhrd − θ∗|22 . σ2 k log(2d/δ)

n.


(ii) if minj∈supp(θ∗) |θ∗j | > 3τ , then

supp(θhrd) = supp(θ∗) .

Proof. Define the event

A ={

maxj|ξj | ≤ τ

},

and recall that Theorem 1.14 yields IP(A) ≥ 1 − δ. On the event A, thefollowing holds for any j = 1, . . . , d.

First, observe that

|yj | > 2τ ⇒ |θ∗j | ≥ |yj | − |ξj | > τ (2.12)

and|yj | ≤ 2τ ⇒ |θ∗j | ≤ |yj |+ |ξj | ≤ 3τ. (2.13)

It yields

|θhrdj − θ∗j | = |yj − θ∗j |1I(|yj | > 2τ) + |θ∗j |1I(|yj | ≤ 2τ)

≤ τ1I(|yj | > 2τ) + |θ∗j |1I(|yj | ≤ 2τ)

≤ τ1I(|θ∗j | > τ) + |θ∗j |1I(|θ∗j | ≤ 3τ) by (2.12) and (2.13)

≤ 4 min(|θ∗j |, τ)

It yields

|θhrd − θ∗|22 =

d∑j=1

|θhrdj − θ∗j |2 ≤ 16

d∑j=1

min(|θ∗j |2, τ2) ≤ 16|θ∗|0τ2 .

This completes the proof of (i).To prove (ii), note that if θ∗j 6= 0, then |θ∗j | > 3τ so that

|yj | = |θ∗j + ξj | > 3τ − τ = 2τ .

Therefore, θhrdj 6= 0 so that supp(θ∗) ⊂ supp(θhrd).

Next, if θhrdj 6= 0, then |θhrdj | = |yj | > 2τ . It yields

|θ∗j | ≥ |yj | − τ > τ.

Therefore, |θ∗j | 6= 0 and supp(θhrd) ⊂ supp(θ∗).

Similar results can be obtained for the soft thresholding estimator θsft

defined by

θsftj =

yj − 2τ if yj > 2τ ,yj + 2τ if yj < −2τ ,0 if |yj | ≤ 2τ ,

In short, we can write

θsftj =(

1− 2τ

|yj |

)+yj

2.4. High-dimensional linear regression 54

−2 −1 0 1 2

−2

−1

01

2

Hard

x

y

−2 −1 0 1 2

−2

−1

01

2

Soft

x

yFigure 2.2. Transformation applied to yj with 2τ = 1 to obtain the hard (left) and soft(right) thresholding estimators

2.4 HIGH-DIMENSIONAL LINEAR REGRESSION

The BIC and Lasso estimators

It can be shown (see Problem 2.6) that the hard and soft thresholding es-timators are solutions of the following penalized empirical risk minimizationproblems:

θhrd = argminθ∈IRd

{|y − θ|22 + 4τ2|θ|0

}θsft = argmin

θ∈IRd

{|y − θ|22 + 4τ |θ|1

}In view of (2.8), under the assumption ORT, the above variational definitionscan be written as


{ 1

n|Y − Xθ|22 + 4τ2|θ|0

}θsft = argmin

θ∈IRd

{ 1

n|Y − Xθ|22 + 4τ |θ|1

}

When the assumption ORT is not satisfied, they no longer correspond to thresh-olding estimators but can still be defined as above. We change the constant inthe threshold parameters for future convenience.

Definition 2.12. Fix τ > 0 and assume the linear regression model (2.2). The


BIC2 estimator of θ∗ is defined by any θbic such that

θbic ∈ argminθ∈IRd

{ 1

n|Y − Xθ|22 + τ2|θ|0

}.

Moreover, the Lasso estimator of θ∗ is defined by any θL such that

θL ∈ argminθ∈IRd

{ 1

n|Y − Xθ|22 + 2τ |θ|1

}.

Remark 2.13 (Numerical considerations). Computing the BIC estimator canbe proved to be NP-hard in the worst case. In particular, no computationalmethod is known to be significantly faster than the brute force search amongall 2d sparsity patterns. Indeed, we can rewrite:

minθ∈IRd

{ 1

n|Y − Xθ|22 + τ2|θ|0

}= min

0≤k≤d

{min

θ : |θ|0=k

1

n|Y − Xθ|22 + τ2k

}To compute minθ : |θ|0=k

1n |Y − Xθ|22, we need to compute

(dk

)least squares

estimators on a space of size k. Each costs O(k3) (matrix inversion). Thereforethe total cost of the brute force search is

C

d∑k=0

(d

k

)k3 = Cd32d .

By contrast, computing the Lasso estimator is a convex problem and thereexist many efficient algorithms to solve it. We will not describe this optimiza-tion problem in details but only highlight a few of the best known algorithms:

1. Probably the most popular method among statisticians relies on coor-dinate gradient descent. It is implemented in the glmnet package in R[FHT10],

2. An interesting method called LARS [EHJT04] computes the entire reg-ularization path, i.e., the solution of the convex problem for all valuesof τ . It relies on the fact that, as a function of τ , the solution θL is apiecewise linear function (with values in IRd). Yet this method provedto be too slow for very large problems and has been replaced by glmnet

which computes solutions for values of τ on a grid much faster.

3. The optimization community has made interesting contributions to thisfield by using proximal methods to solve this problem. These methodsexploit the structure of the objective function which is of the form smooth(sum of squares) + simple (`1 norm). A good entry point to this literatureis perhaps the FISTA algorithm [BT09].

2Note that it minimizes the Bayes Information Criterion (BIC) employed in the tradi-

tional literature of asymptotic statistics if τ =√

log(d)/n. We will use the same value below,up to multiplicative constants (it’s the price to pay to get non asymptotic results).


4. Recently there has been a lot of interest around this objective for verylarge d and very large n. In this case, even computing |Y − Xθ|22 maybe computationally expensive and solutions based on stochastic gradientdescent are flourishing.

Note that by Lagrange duality, computing θL is equivalent to solving an`1 constrained least squares. Nevertheless, the radius of the `1 constraint isunknown. In general it is hard to relate Lagrange multipliers to the size con-straints. The name “Lasso” was originally given to the constrained version thisestimator in the original paper of Robert Tibshirani [Tib96].

Analysis of the BIC estimator

While computationally hard to implement, the BIC estimator gives us a goodbenchmark for sparse estimation. Its performance is similar to that of θhrd butwithout requiring the assumption ORT.

Theorem 2.14. Assume that the linear model (2.2) holds, where ε ∼ subGn(σ2)

and that |θ∗|0 ≥ 1. Then, the BIC estimator θbic with regularization parameter

τ2 = 16 log(6)σ2

n+ 32

σ2 log(ed)

n. (2.14)

satisfies

MSE(Xθbic) =1

n|Xθbic − Xθ∗|22 . |θ∗|0σ2 log(ed/δ)

n(2.15)

with probability at least 1− δ.

Proof. We begin as usual by noting that

1

n|Y − Xθbic|22 + τ2|θbic|0 ≤

1

n|Y − Xθ∗|22 + τ2|θ∗|0 .

It implies

|Xθbic − Xθ∗|22 ≤ nτ2|θ∗|0 + 2ε>X(θbic − θ∗)− nτ2|θbic|0 .

First, note that

2ε>X(θbic − θ∗) = 2ε>( Xθbic − Xθ∗

|Xθbic − Xθ∗|2

)|Xθbic − Xθ∗|2

≤ 2[ε>( Xθbic − Xθ∗

|Xθbic − Xθ∗|2

)]2+

1

2|Xθbic − Xθ∗|22 ,

where we use the inequality 2ab ≤ 2a2 + 12b

2. Together with the previousdisplay, it yields

|Xθbic − Xθ∗|22 ≤ 2nτ2|θ∗|0 + 4[ε>U(θbic − θ∗)

]2 − 2nτ2|θbic|0 (2.16)


whereU(z) =

z

|z|2

Next, we need to “sup out” θbic. To that end, we decompose the sup into amax over cardinalities as follows:

supθ∈IRd

= max1≤k≤d

max|S|=k

supsupp(θ)=S

.

Applied to the above inequality, it yields

4[ε>U(θbic − θ∗)

]2 − 2nτ2|θbic|0≤ max

1≤k≤d

{max|S|=k

supsupp(θ)=S

4[ε>U(θ − θ∗)

]2 − 2nτ2k}

≤ max1≤k≤d

{max|S|=k

supu∈B

rS,∗2

4[ε>ΦS,∗u

]2 − 2nτ2k},

where ΦS,∗ = [φ1, . . . , φrS,∗ ] is an orthonormal basis of the set {Xj , j ∈ S ∪supp(θ∗)} of columns of X and rS,∗ ≤ |S| + |θ∗|0 is the dimension of thiscolumn span.

Using union bounds, we get for any t > 0,

IP(

max1≤k≤d

{max|S|=k

supu∈B

rS,∗2

4[ε>ΦS,∗u

]2 − 2nτ2k}≥ t)

≤d∑k=1

∑|S|=k

IP(

supu∈B

rS,∗2

[ε>ΦS,∗u

]2 ≥ t

4+

1

2nτ2k

)Moreover, using the ε-net argument from Theorem 1.19, we get for |S| = k,

IP(

supu∈B

rS,∗2

[ε>ΦS,∗u

]2 ≥ t

4+

1

2nτ2k

)≤ 2 · 6rS,∗ exp

(−

t4 + 1

2nτ2k

8σ2

)≤ 2 exp

(− t

32σ2− nτ2k

16σ2+ (k + |θ∗|0) log(6)

)≤ exp

(− t

32σ2− 2k log(ed) + |θ∗|0 log(12)

)where, in the last inequality, we used the definition (2.14) of τ .


Putting everything together, we get

IP(|Xθbic − Xθ∗|22 ≥ 2nτ2|θ∗|0 + t

)≤

d∑k=1

∑|S|=k

exp(− t

32σ2− 2k log(ed) + |θ∗|0 log(12)

)

=

d∑k=1

(d

k

)exp

(− t

32σ2− 2k log(ed) + |θ∗|0 log(12)

)≤

d∑k=1

exp(− t

32σ2− k log(ed) + |θ∗|0 log(12)

)by Lemma 2.7

=

d∑k=1

(ed)−k exp(− t

32σ2+ |θ∗|0 log(12)

)≤ exp

(− t

32σ2+ |θ∗|0 log(12)

).

To conclude the proof, choose t = 32σ2|θ∗|0 log(12)+32σ2 log(1/δ) and observethat combined with (2.16), it yields with probability 1− δ,

|Xθbic − Xθ∗|22 ≤ 2nτ2|θ∗|0 + t

= 64σ2 log(ed)|θ∗|0 + 64 log(12)σ2|θ∗|0 + 32σ2 log(1/δ)

≤ 224|θ∗|0σ2 log(ed) + 32σ2 log(1/δ) .

It follows from Theorem 2.14 that θbic adapts to the unknown sparsity ofθ∗, just like θhrd. Moreover, this holds under no assumption on the designmatrix X.

Analysis of the Lasso estimator

Slow rate for the Lasso estimator

The properties of the BIC estimator are quite impressive. It shows that underno assumption on X, one can mimic two oracles: (i) the oracle that knows thesupport of θ∗ (and computes least squares on this support), up to a log(ed)term and (ii) the oracle that knows the sparsity |θ∗|0 of θ∗, up to a smaller loga-rithmic term log(ed/|θ∗|0), which is subsumed by log(ed). Actually, the log(ed)can even be improved to log(ed/|θ∗|0) by using a modified BIC estimator (seeProblem 2.7).

The Lasso estimator is a bit more difficult to analyze because, by construc-tion, it should more naturally adapt to the unknown `1-norm of θ∗. This canbe easily shown as in the next theorem, analogous to Theorem 2.4.


Theorem 2.15. Assume that the linear model (2.2) holds where ε ∼ subGn(σ2).Moreover, assume that the columns of X are normalized in such a way thatmaxj |Xj |2 ≤

√n. Then, the Lasso estimator θL with regularization parameter

2τ = 2σ

√2 log(2d)

n+ 2σ

√2 log(1/δ)

n(2.17)

satisfies

MSE(XθL) =1

n|XθL − Xθ∗|22 ≤ 4|θ∗|1σ

√2 log(2d)

n+ 4|θ∗|1σ

√2 log(1/δ)

n


Proof. From the definition of θL, it holds

1

n|Y − XθL|22 + 2τ |θL|1 ≤

1

n|Y − Xθ∗|22 + 2τ |θ∗|1 .

Using Holder’s inequality, it implies

|XθL − Xθ∗|22 ≤ 2ε>X(θL − θ∗) + 2nτ(|θ∗|1 − |θL|1

)≤ 2|X>ε|∞|θL|1 − 2nτ |θL|1 + 2|X>ε|∞|θ∗|1 + 2nτ |θ∗|1= 2(|X>ε|∞ − nτ)|θL|1 + 2(|X>ε|∞ + nτ)|θ∗|1

Observe now that for any t > 0,

IP(|X>ε|∞ ≥ t) = IP( max1≤j≤d

|X>j ε| > t) ≤ 2de−t2

2nσ2

Therefore, taking t = σ√

2n log(2d) + σ√

2n log(1/δ) = nτ , we get that withprobability at least 1− δ,

|XθL − Xθ∗|22 ≤ 4nτ |θ∗|1 .

Notice that the regularization parameter (2.17) depends on the confidencelevel δ. This not the case for the BIC estimator (see (2.14)).

The rate in Theorem 2.15 is of order√

(log d)/n (slow rate), which is muchslower than the rate of order (log d)/n (fast rate) for the BIC estimator. Here-after, we show that fast rates can also be achieved by the computationallyefficient Lasso estimator, but at the cost of a much stronger condition on thedesign matrix X.

Incoherence

Assumption INC(k) We say that the design matrix X has incoherence k forsome integer k > 0 if ∣∣X>X

n− Id

∣∣∞ ≤

1

32k

where the |A|∞ denotes the largest element of A in absolute value. Equivalently,


1. For all j = 1, . . . , d, ∣∣∣∣ |Xj |22n− 1

∣∣∣∣ ≤ 1

32k.

2. For all 1 ≤ i, j ≤ d, i 6= j, we have

|X>i Xj |n

≤ 1

32k.

Note that Assumption ORT arises as the limiting case of INC(k) as k → ∞.However, while Assumption ORT requires d ≤ n, here we may have d � n asillustrated in Proposition 2.16 below. To that end, we simply have to showthat there exists a matrix that satisfies INC(k) even for d > n. We resort tothe probabilistic method [AS08]. The idea of this method is that if we can finda probability measure that assigns positive probability to objects that satisfya certain property, then there must exist objects that satisfy said property.

In our case, we consider the following probability distribution on randommatrices with entries in {±1}. Let the design matrix X have entries that arei.i.d. Rademacher (±1) random variables. We are going to show that mostrealizations of this random matrix satisfy Assumption INC(k) for large enoughn.

Proposition 2.16. Let X ∈ IRn×d be a random matrix with entries Xij , i =1, . . . , n, j = 1, . . . , d, that are i.i.d. Rademacher (±1) random variables. Then,X has incoherence k with probability 1− δ as soon as

n ≥ 211k2 log(1/δ) + 213k2 log(d) .

It implies that there exists matrices that satisfy Assumption INC(k) for

n ≥ Ck2 log(d) ,

for some numerical constant C.

Proof. Let εij ∈ {−1, 1} denote the Rademacher random variable that is onthe ith row and jth column of X.

Note first that the jth diagonal entries of X>X/n are given by

1

n

n∑i=1

ε2i,j = 1.

Moreover, for j 6= k, the (j, k)th entry of the d× d matrix X>X/n is given by

1

n

n∑i=1

εi,jεi,k =1

n

n∑i=1

ξ(j,k)i ,

where for each pair (j, k), ξ(j,k)i = εi,jεi,k, so that ξ

(j,k)1 , . . . , ξ

(j,k)n are i.i.d.

Rademacher random variables.


Therefore, we get that for any t > 0,

IP

(∣∣∣∣X>Xn − Id∣∣∣∣∞> t

)= IP

(maxj 6=k

∣∣∣ 1n

n∑i=1

ξ(j,k)i

∣∣∣ > t)

≤∑j 6=k

IP(∣∣∣ 1n

n∑i=1

ξ(j,k)i

∣∣∣ > t)

(Union bound)

≤∑j 6=k

2e−nt2

2 (Hoeffding: Theorem 1.9)

≤ 2d2e−nt2

2 .

Taking now t = 1/(32k) yields

IP(∣∣X>X

n− Id

∣∣∞ >

1

32k

)≤ 2d2e−

n211k2 ≤ δ

forn ≥ 211k2 log(1/δ) + 213k2 log(d) .

For any θ ∈ IRd, S ⊂ {1, . . . , d}, define θS to be the vector with coordinates

θS,j =

{θj if j ∈ S ,0 otherwise .

In particular |θ|1 = |θS |1 + |θSc |1.The following lemma holds

Lemma 2.17. Fix a positive integer k ≤ d and assume that X satisfies as-sumption INC(k). Then, for any S ∈ {1, . . . , d} such that |S| ≤ k and anyθ ∈ IRd that satisfies the cone condition

|θSc |1 ≤ 3|θS |1 , (2.18)

it holds

|θ|22 ≤ 2|Xθ|22n

.

Proof. We have

|Xθ|22n

=|XθS |22n

+|XθSc |22n

+ 2θ>SX>Xn

θSc .

We bound each of the three terms separately.

(i) First, if follows from the incoherence condition that

|XθS |22n

= θ>SX>Xn

θS = |θS |22 + θ>S

(X>Xn− Id

)θS ≥ |θS |22 −

|θS |2132k

,


(ii) Similarly,|XθSc |22n

≥ |θSc |22 −|θSc |2132k

≥ |θSc |22 −9|θS |2132k

,

where, in the last inequality, we used the cone condition (2.18).

(iii) Finally,

2∣∣∣θ>S X>X

nθSc∣∣∣ ≤ 2

32k|θS |1|θSc |1 ≤

6

32k|θS |21.

where, in the last inequality, we used the cone condition (2.18).

Observe now that it follows from the Cauchy-Schwarz inequality that

|θS |21 ≤ |S||θS |22.

Thus for |S| ≤ k,

|Xθ|22n≥ |θS |22 + |θSc |22 −

16|S|32k

|θS |22 ≥1

2|θ|22.

Fast rate for the Lasso

Theorem 2.18. Fix n ≥ 2. Assume that the linear model (2.2) holds where ε ∼subGn(σ2). Moreover, assume that |θ∗|0 ≤ k and that X satisfies assumption

INC(k). Then the Lasso estimator θL with regularization parameter defined by

2τ = 8σ

√log(2d)

n+ 8σ

√log(1/δ)

n

satisfies

MSE(XθL) =1

n|XθL − Xθ∗|22 . kσ2 log(2d/δ)

n(2.19)

and

|θL − θ∗|22 . kσ2 log(2d/δ)

n. (2.20)


Proof. From the definition of θL, it holds

1

n|Y − XθL|22 ≤

1

n|Y − Xθ∗|22 + 2τ |θ∗|1 − 2τ |θL|1 .

Adding τ |θL − θ∗|1 on each side and multiplying by n, we get

|XθL−Xθ∗|22+nτ |θL−θ∗|1 ≤ 2ε>X(θL−θ∗)+nτ |θL−θ∗|1+2nτ |θ∗|1−2nτ |θL|1 .

Applying Holder’s inequality and using the same steps as in the proof of The-orem 2.15, we get that with probability 1− δ, we get

ε>X(θL − θ∗) ≤ |ε>X|∞|θL − θ∗|1

≤ nτ

2|θL − θ∗|1 ,


where we used the fact that |Xj |22 ≤ n + 1/(32k) ≤ 2n. Therefore, takingS = supp(θ∗) to be the support of θ∗, we get

|XθL − Xθ∗|22 + nτ |θL − θ∗|1 ≤ 2nτ |θL − θ∗|1 + 2nτ |θ∗|1 − 2nτ |θL|1= 2nτ |θLS − θ∗|1 + 2nτ |θ∗|1 − 2nτ |θLS |1≤ 4nτ |θLS − θ∗|1. (2.21)

In particular, it implies that

|θLSc − θ∗Sc |1 ≤ 3|θLS − θ∗S |1 . (2.22)

so that θ = θL− θ∗ satisfies the cone condition (2.18). Using now the Cauchy-Schwarz inequality and Lemma 2.17 respectively, we get, since |S| ≤ k,

|θLS − θ∗|1 ≤√|S||θLS − θ∗|2 ≤

√|S||θL − θ∗|2 ≤

√2k

n|XθL − Xθ∗|2 .

Combining this result with (2.21), we find

|XθL − Xθ∗|22 ≤ 32nkτ2 .

This concludes the proof of the bound on the MSE. To prove (2.20), we useLemma 2.17 once again to get

|θL − θ∗|22 ≤ 2MSE(XθL) ≤ 64kτ2.

Note that all we required for the proof was not really incoherence but theconclusion of Lemma 2.17:

inf|S|≤k

infθ∈CS

|Xθ|22n|θ|22

≥ κ, (2.23)

where κ = 1/2 and CS is the cone defined by

CS ={|θSc |1 ≤ 3|θS |1

}.

Condition (2.23) is sometimes called restricted eigenvalue (RE) condition. Itsname comes from the following observation. Note that all k-sparse vectors θare in a cone CS with |S| ≤ k so that the RE condition implies that the smallesteigenvalue of XS satisfies λmin(XS) ≥ nκ for all S such that |S| ≤ k. Clearly,the RE condition is weaker than incoherence and it can actually be shownthat a design matrix X of i.i.d. Rademacher random variables satisfies the REconditions as soon as n ≥ Ck log(d) with positive probability.

The Slope estimator

We noticed that the BIC estimator can actually obtain a rate of k log(ed/k),where k = |θ∗|0, while the LASSO only achieves k log(ed). This begs the


question whether the same rate can be achieved by an efficiently computableestimator. The answer is yes, and the estimator achieving this is a slightmodification of the LASSO, where the entries in the `1-norm are weighteddifferently according to their size in order to get rid of a slight inefficiency inour bounds.

Definition 2.19 (Slope estimator). Let λ = (λ1, . . . , λd) be a non-increasingsequence of positive real numbers, λ1 ≥ λ2 ≥ · · · ≥ λd > 0. For θ =(θ1, . . . , θd) ∈ IRd, let (θ∗1 , . . . , θ

∗d) be a non-increasing rearrangement of the

modulus of the entries, |θ1|, . . . , |θd|.We define the sorted `1 norm of θ as

|θ|∗ =

d∑j=1

λjθ∗j , (2.24)

or equivalently as

|θ|∗ = maxφ∈Sd

d∑j=1

λj |θφ(j)|. (2.25)

The Slope estimator is then given by

θS ∈ argminθ∈IRd

{ 1

n|Y − Xθ|22 + 2τ |θ|∗} (2.26)

for a choice of tuning parameters λ and τ > 0.

Slope stands for Sorted L-One Penalized Estimation and was introducedin [BvS+15], motivated by the quest for a penalized estimation procedure thatcould offer a control of false discovery rate for the hypotheses H0,j : θ∗j = 0. We

can check that | . |∗ is indeed a norm and that θS can be computed efficiently,for example by proximal gradient algorithms, see [BvS+15].

In the following, we will consider

λj =√

log(2d/j), j = 1, . . . , d. (2.27)

With this choice, we will exhibit a scaling in τ that leads to the desired highprobability bounds, following the proofs in [BLT16].

We begin with refined bounds on the suprema of Gaussians.

Lemma 2.20. Let g1, . . . , gd be zero-mean Gaussian variables with varianceat most σ2. Then, for any k ≤ d,

IP

1

kσ2

k∑j=1

(g∗j )2 > t log

(2d

k

) ≤ (2d

k

)1−3t/8

. (2.28)

Proof. We consider the moment generating function to apply a Chernoff bound.


Use Jensen’s inequality on the sum to get

IE exp

3

8kσ2

k∑j=1

(g∗j )2

≤ 1

k

k∑j=1

IE exp

(3(g∗j )2

8σ2

)

≤ 1

k

d∑j=1

IE exp

(3g2j

8σ2

)

≤ 2d

k,

where we used the bound on the moment generating function of the χ2 distri-bution from Problem 1.2 to get that IE[exp(3g2/8)] = 2 for g ∼ N (0, 1).

The rest follows from a Chernoff bound.

Lemma 2.21. Define [d] := {1, . . . , d}. Under the same assumptions as inLemma 2.20,

supk∈[d]

(g∗k)

σλk≤ 4√

log(1/δ), (2.29)

with probability at least 1− δ, for δ < 12 .

Proof. We can estimate (g∗k)2 by the average of all larger variables,

(g∗k)2 ≤ 1

k

k∑j=1

(g∗j )2. (2.30)

Applying Lemma 2.20 yields

IP

((g∗k)2

σ2λ2k

> t

)≤(

2d

k

)1−3t/8

(2.31)

For t > 8,

IP

(supk∈[d]

(g∗k)2

σ2λ2k

> t

)≤

d∑j=1

(2d

j

)1−3t/8

(2.32)

≤ (2d)1−3t/8d∑j=1

1

j2(2.33)

≤ 4 · 2−3t/8. (2.34)

In turn, this means that

supk∈[d]

(g∗k)

σλk≤ 4√

log(1/δ), (2.35)

for δ ≤ 1/2.


Theorem 2.22. Fix n ≥ 2. Assume that the linear model (2.2) holds whereε ∼ Nn(0, σ2In). Moreover, assume that |θ∗|0 ≤ k and that X satisfies as-

sumption INC(k′) with k′ ≥ 4k log(2de/k). Then the Slope estimator θS withregularization parameter defined by

τ = 8√

2σ

√log(1/δ)

n(2.36)

satisfies

MSE(XθS) =1

n|XθS − Xθ∗|22 . σ2 k log(2d/k) log(1/δ)

n(2.37)

and

|θS − θ∗|22 . σ2 k log(2d/k) log(1/δ)

n. (2.38)


Remark 2.23. The multplicative depedence on log(1/δ) in (2.37) is an artifactof the simplified proof we give here and can be improved to an additive onesimilar to the lasso case, (2.19), by appealing to stronger and more generalconcentration inequalities. For more details, see [BLT16].

Proof. By minimality of θS in (2.26), adding nτ |θS − θ∗|∗ on both sides,

|XθS−Xθ∗|22+nτ |θS−θ∗|∗ ≤ 2ε>X(θS−θ∗)+nτ |θS−θ∗|∗+2τn|θ∗|∗−2τn|θS |∗ .(2.39)

Setu := θS − θ∗, gj = (X>ε)j , (2.40)

By Lemma 2.21, we can estimate

ε>Xu =

d∑j=1

(X>ε)juj ≤d∑j=1

g∗ju∗j (2.41)

=

d∑j=1

g∗jλj

(λju∗j ) (2.42)

≤ supj

{g∗jλj

}|u|∗ (2.43)

≤ 4√

2σ√n log(1/δ)|u|∗ =

nτ

2|u|∗, (2.44)

where we used that |Xj |22 ≤ 2n.


Pick a permutation φ such that |θ∗|∗ =∑kj=1 λj |θφ(j)| and |uφ(k+1)| ≥ · · · ≥

|uφ(d)|. Then, noting that λj is monotonically decreasing,

|θ∗|∗ − |θS |∗ =

k∑j=1

λj |θ∗φ(j)| −d∑j=1

λj(θS)∗j (2.45)

≤k∑j=1

λj(|θ∗φ(j)| − |θSφ(j)|)−

d∑j=k+1

λj |θSφ(j)| (2.46)

≤k∑j=1

λj |uφ(j)| −d∑

j=k+1

λju∗j (2.47)

≤k∑j=1

λju∗j −

d∑j=k+1

λju∗j . (2.48)

Combined with τ = 8√

2σ√

log(1/δ)/n and the basic inequality (2.39), wehave that

|XθS − Xθ∗|22 + nτ |u|∗ ≤ 2nτ |u|∗ + 2τn|θ∗|∗ − 2τn|θS |∗ (2.49)

≤ 4nτ

k∑j=1

λju∗j , (2.50)

whenced∑

j=k+1

λju∗j ≤ 3

k∑j=1

λju∗j . (2.51)

We can now repeat the incoherence arguments from Lemma 2.17, with Sbeing the k largest entries of |u|, to get the same conclusion under the restrictionINC(k′). First, by exactly the same argument as in Lemma 2.17, we have

|XuS |22n

≥ |uS |22 −|uS |2132k′

≥ |uS |22 −k

32k′|uS |22. (2.52)


Next, for the cross terms, we have

2∣∣∣u>S X>X

nuSc∣∣∣ ≤ 2

32k′

k∑i=1

u∗i

d∑j=k+1

u∗j (2.53)

=2

32k′

k∑i=1

u∗i

d∑j=k+1

λjλju∗j (2.54)

≤ 2

32k′λd

k∑i=1

u∗i

d∑j=k+1

λju∗j (2.55)

≤ 6

32k′λd

k∑i=1

u∗i

k∑j=1

λju∗j (2.56)

≤ 6√k

32k′λd

(k∑i=1

λ2j

)1/2 k∑i=1

(u∗i )2 (2.57)

≤ 6k

32k′λd

√log(2de/k)|uS |22. (2.58)

where we estimated the sum by

k∑j=1

log

(2d

j

)= k log(2d)− log(k!) ≤ k log(2d)− k log(k/e) (2.59)

= k log(2de/k), (2.60)

using Stirling’s inequality, k! ≥ (k/e)k.Finally, from a similar calculation, using the cone condition twice,

|XuSc |22n

≥ |uSc |22 −9k

32k′λ2d

log(2de/k)|uS |22. (2.61)

Concluding, if INC(k′) holds with k′ ≥ 4k log(2de/k), taking into accountthat λd ≥ 1/2, we have

|Xu|22n≥ |uS |22 + |uSc |22 −

36 + 12 + 1

128|uS |22 ≥

1

2|u|22. (2.62)

Hence, from (2.50),

|Xu|22 + nτ |u|∗ ≤ 4nτ

k∑j=1

λ2j

1/2 k∑j=1

(u∗j )2

1/2

(2.63)

≤ 4√

2τ√nk log(2de/k)|Xu|2 (2.64)

= 26σ√

log(1/δ)√k log(2de/k)|Xu|2, (2.65)


which combined yields

1

n|Xu|22 ≤ 212σ2 k

nlog(2de/k) log(1/δ), (2.66)

and

|u|22 ≤ 213 k

nσ2 log(2de/k) log(1/δ). (2.67)

2.5. Problem set 70

2.5 PROBLEM SET

Problem 2.1. Consider the linear regression model with fixed design withd ≤ n. The ridge regression estimator is employed when rank(X>X) < d butwe are interested in estimating θ∗. It is defined for a given parameter τ > 0 by

θridgeτ = argmin

θ∈IRd

{ 1

n|Y − Xθ|22 + τ |θ|22

}.

(a) Show that for any τ , θridgeτ is uniquely defined and give its closed form

expression.

(b) Compute the bias of θridgeτ and show that it is bounded in absolute value

by |θ∗|2.

Problem 2.2. Let X = (1, Z, . . . , Zd−1)> ∈ IRd be a random vector where Zis a random variable. Show that the matrix IE(XX>) is positive definite if Zadmits a probability density with respect to the Lebesgue measure on IR.

Problem 2.3. Let θhrd be the hard thresholding estimator defined in Defini-tion 2.10.

1. Show that

|θhrd|0 ≤ |θ∗|0 +|θhrd − θ∗|22

4τ2

2. Conclude that if τ is chosen as in Theorem 2.11, then

|θhrd|0 ≤ C|θ∗|0

with probability 1− δ for some constant C to be made explicit.

Problem 2.4. In the proof of Theorem 2.11, show that 4 min(|θ∗j |, τ) can bereplaced by 3 min(|θ∗j |, τ), i.e., that on the event A, it holds

|θhrdj − θ∗j | ≤ 3 min(|θ∗j |, τ) .

Problem 2.5. For any q > 0, a vector θ ∈ IRd is said to be in a weak `q ballof radius R if the decreasing rearrangement |θ[1]| ≥ |θ[2]| ≥ . . . satisfies

|θ[j]| ≤ Rj−1/q .

Moreover, we define the weak `q norm of θ by

|θ|w`q = max1≤j≤d

j1/q|θ[j]|.

(a) Give examples of θ, θ′ ∈ IRd such that

|θ + θ′|w`1 > |θ|w`1 + |θ′|w`1

What do you conclude?

2.5. Problem set 71

(b) Show that |θ|w`q ≤ |θ|q.

(c) Given a sequence θ1, θ2, . . ., show that if limd→∞ |θ{1,...,d}|w`q <∞, thenlimd→∞ |θ{1,...,d}|q′ <∞ for all q′ > q.

(d) Show that, for any q ∈ (0, 2) if limd→∞ |θ{1,...,d}|w`q = C, there existsa constant Cq > 0 that depends on q but not on d such that under theassumptions of Theorem 2.11, it holds

|θhrd − θ∗|22 ≤ Cq(σ2 log 2d

n

)1− q2

with probability .99.

Problem 2.6. Show that


{|y − θ|22 + 4τ2|θ|0

}θsft = argmin

θ∈IRd

{|y − θ|22 + 4τ |θ|1

}Problem 2.7. Assume the linear model (2.2) with ε ∼ subGn(σ2) and θ∗ 6= 0.

Show that the modified BIC estimator θ defined by

θ ∈ argminθ∈IRd

{ 1

n|Y − Xθ|22 + λ|θ|0 log

( ed|θ|0

)}satisfies

MSE(Xθ) . |θ∗|0σ2log(ed|θ∗|0

)n

.

with probability .99, for appropriately chosen λ. What do you conclude?

Problem 2.8. Assume that the linear model (2.2) holds where ε ∼ subGn(σ2).Moreover, assume the conditions of Theorem 2.2 and that the columns of Xare normalized in such a way that maxj |Xj |2 ≤

√n. Then the Lasso estimator

θL with regularization parameter

2τ = 8σ

√2 log(2d)

n,

satisfies|θL|1 ≤ C|θ∗|1

with probability 1− (2d)−1 for some constant C to be specified.

Chapter

3Misspecified Linear Models

Arguably, the strongest assumption that we made in Chapter 2 is that theregression function f(x) is of the form f(x) = x>θ∗. What if this assumptionis violated? In reality, we do not really believe in the linear model and we hopethat good statistical methods should be robust to deviations from this model.This is the problem of model misspecification, and in particular misspecifiedlinear models.

Throughout this chapter, we assume the following model:

Yi = f(Xi) + εi, i = 1, . . . , n , (3.1)

where ε = (ε1, . . . , εn)> is sub-Gaussian with variance proxy σ2. HereXi ∈ IRd.When dealing with fixed design, it will be convenient to consider the vectorg ∈ IRn defined for any function g : IRd → IR by g = (g(X1), . . . , g(Xn))>. In

this case, we can write for any estimator f ∈ IRn of f ,

MSE(f) =1

n|f − f |22 .

Even though the model may not be linear, we are interested in studying thestatistical properties of various linear estimators introduced in the previouschapters: θls, θlsK , θ

lsX , θ

bic, θL. Clearly, even with an infinite number of obser-vations, we have no chance of finding a consistent estimator of f if we don’tknow the correct model. Nevertheless, as we will see in this chapter, somethingcan still be said about these estimators using oracle inequalities.

72

3.1. Oracle inequalities 73

3.1 ORACLE INEQUALITIES

Oracle inequalities

As mentioned in the introduction, an oracle is a quantity that cannot be con-structed without the knowledge of the quantity of interest, here: the regressionfunction. Unlike the regression function itself, an oracle is constrained to takea specific form. For all matter of purposes, an oracle can be viewed as anestimator (in a given family) that can be constructed with an infinite amountof data. This is exactly what we should aim for in misspecified models.

When employing the least squares estimator θls, we constrain ourselves toestimating functions that are of the form x 7→ x>θ, even though f itself maynot be of this form. Therefore, the oracle f is the linear function that is theclosest to f .

Rather than trying to approximate f by a linear function f(x) ≈ θ>x, wemake the model a bit more general and consider a dictionaryH = {ϕ1, . . . , ϕM}of functions where ϕj : IRd → IR. In this case, we can actually remove theassumption that X ∈ IRd. Indeed, the goal is now to estimate f using a linearcombination of the functions in the dictionary:

f ≈ ϕθ :=

M∑j=1

θjϕj .

Remark 3.1. If M = d and ϕj(X) = X(j) returns the jth coordinate ofX ∈ IRd then the goal is to approximate f(x) by θ>x. Nevertheless, the use ofa dictionary allows for a much more general framework.

Note that the use of a dictionary does not affect the methods that we havebeen using so far, namely penalized/constrained least squares. We use thesame notation as before and define

1. The least squares estimator:

θls ∈ argminθ∈IRM

1

n

n∑i=1

(Yi − ϕθ(Xi)

)2(3.2)

2. The least squares estimator constrained to K ⊂ IRM :

θlsK ∈ argminθ∈K

1

n

n∑i=1

(Yi − ϕθ(Xi)

)23. The BIC estimator:

θbic ∈ argminθ∈IRM

{ 1

n

n∑i=1

(Yi − ϕθ(Xi)

)2+ τ2|θ|0

}(3.3)


4. The Lasso estimator:

θL ∈ argminθ∈IRM

{ 1

n

n∑i=1

(Yi − ϕθ(Xi)

)2+ 2τ |θ|1

}(3.4)

Definition 3.2. Let R(·) be a risk function and let H = {ϕ1, . . . , ϕM} be adictionary of functions from IRd to IR. Let K be a subset of IRM . The oracleon K with respect to R is defined by ϕθ, where θ ∈ K is such that

R(ϕθ) ≤ R(ϕθ) , ∀ θ ∈ K .

Moreover, RK = R(ϕθ) is called oracle risk on K. An estimator f is saidto satisfy an oracle inequality (over K) with remainder term φ in expectation(resp. with high probability) if there exists a constant C ≥ 1 such that

IER(f) ≤ C infθ∈K

R(ϕθ) + φn,M (K) ,

orIP{R(f) ≤ C inf

θ∈KR(ϕθ) + φn,M,δ(K)

}≥ 1− δ , ∀ δ > 0

respectively. If C = 1, the oracle inequality is sometimes called exact.

Our goal will be to mimic oracles. The finite sample performance of anestimator at this task is captured by an oracle inequality.

Oracle inequality for the least squares estimator

While our ultimate goal is to prove sparse oracle inequalities for the BIC andLasso estimator in the case of misspecified model, the difficulty of the extensionto this case for linear models is essentially already captured by the analysis forthe least squares estimator. In this simple case, can even obtain an exact oracleinequality.

Theorem 3.3. Assume the general regression model (3.1) with ε ∼ subGn(σ2).

Then, the least squares estimator θls satisfies for some numerical constantC > 0,

MSE(ϕθls) ≤ infθ∈IRM

MSE(ϕθ) + Cσ2M

nlog(1/δ)


Proof. Note that by definition

|Y − ϕθls |22 ≤ |Y − ϕθ|22

where ϕθ denotes the orthogonal projection of f onto the linear span of ϕ1, . . . , ϕM .Since Y = f + ε, we get

|f − ϕθls |22 ≤ |f − ϕθ|22 + 2ε>(ϕθls − ϕθ)


Moreover, by Pythagoras’s theorem, we have

|f − ϕθls |22 − |f − ϕθ|22 = |ϕθls − ϕθ|

22 .

It yields|ϕθls − ϕθ|

22 ≤ 2ε>(ϕθls − ϕθ) .

Using the same steps as the ones following equation (2.5) for the well specifiedcase, we get

|ϕθls − ϕθ|22 .

σ2M

nlog(1/δ)

with probability 1− δ. The result of the lemma follows.

Sparse oracle inequality for the BIC estimator

The analysis for more complicated estimators such as the BIC in Chapter 2allow us to derive oracle inequalities for these estimators.

Theorem 3.4. Assume the general regression model (3.1) with ε ∼ subGn(σ2).

Then, the BIC estimator θbic with regularization parameter

τ2 = 16 log(6)σ2

n+ 32

σ2 log(ed)

n(3.5)

satisfies for some numerical constant C > 0,

MSE(ϕθbic) ≤ infθ∈IRM

{3MSE(ϕθ)+

Cσ2

n|θ|0 log(eM/δ)

}+Cσ2

nlog(1/δ)


Proof. Recall that the proof of Theorem 2.14 for the BIC estimator begins asfollows:

1

n|Y − ϕθbic |

22 + τ2|θbic|0 ≤

1

n|Y − ϕθ|22 + τ2|θ|0 .

This is true for any θ ∈ IRM . It implies

|f − ϕθbic |22 + nτ2|θbic|0 ≤ |f − ϕθ|22 + 2ε>(ϕθbic − ϕθ) + nτ2|θ|0 .

Note that if θbic = θ, the result is trivial. Otherwise,

2ε>(ϕθbic − ϕθ) = 2ε>( ϕθbic − ϕθ|ϕθbic − ϕθ|2

)|ϕθbic − ϕθ|2

≤ 2

α

[ε>( ϕθbic − ϕθ|ϕθbic − ϕθ|2

)]2+α

2|ϕθbic − ϕθ|

22 ,

where we used Young’s inequality 2ab ≤ 2αa

2 + α2 b

2, which is valid for a, b ∈ IRand α > 0. Next, since

α

2|ϕθbic − ϕθ|

22 ≤ α|ϕθbic − f |

22 + α|ϕθ − f |22 ,


we get for α = 1/2,

1

2|ϕθbic − f |

22 ≤

3

2|ϕθ − f |22 + nτ2|θ|0 + 4

[ε>U(ϕθbic − ϕθ)

]2 − nτ2|θbic|0

≤ 3

2|ϕθ − f |22 + 2nτ2|θ|0 + 4

[ε>U(ϕθbic−θ)

]2 − nτ2|θbic − θ|0.

We conclude as in the proof of Theorem 2.14.

The interpretation of this theorem is enlightening. It implies that theBIC estimator will mimic the best tradeoff between the approximation errorMSE(ϕθ) and the complexity of θ as measured by its sparsity. In particular, thisresult, which is sometimes called sparse oracle inequality, implies the followingoracle inequality. Define the oracle θ to be such that

MSE(ϕθ) = minθ∈IRM

MSE(ϕθ).

Then, with probability at least 1− δ,

MSE(ϕθbic) ≤ 3MSE(ϕθ) +Cσ2

n

[|θ|0 log(eM) + log(1/δ)

]}If the linear model happens to be correct, then we simply have MSE(ϕθ) = 0.

Sparse oracle inequality for the Lasso

To prove an oracle inequality for the Lasso, we need additional assumptionson the design matrix, such as incoherence. Here, the design matrix is given bythe n×M matrix Φ with elements Φi,j = ϕj(Xi).

Theorem 3.5. Assume the general regression model (3.1) with ε ∼ subGn(σ2).Moreover, assume that there exists an integer k such that the matrix Φ satisfiesassumption INC(k). Then, the Lasso estimator θL with regularization param-eter given by

2τ = 8σ

√2 log(2M)

n+ 8σ

√2 log(1/δ)

n(3.6)

satisfies for some numerical constant C,

MSE(ϕθL) ≤ infθ∈IRM

|θ|0≤k

{MSE(ϕθ) +

Cσ2

n|θ|0 log(eM/δ)

}


Proof. From the definition of θL, for any θ ∈ IRM ,

1

n|Y − ϕθL |

22 ≤

1

n|Y − ϕθ|22 + 2τ |θ|1 − 2τ |θL|1 .


Expanding the squares, adding τ |θL − θ|1 on each side and multiplying by n,we get

|ϕθL − f |22 − |ϕθ − f |22 + nτ |θL − θ|1

≤ 2ε>(ϕθL − ϕθ) + nτ |θL − θ|1 + 2nτ |θ|1 − 2nτ |θL|1 . (3.7)

Next, note that INC(k) for any k ≥ 1 implies that |ϕj |2 ≤ 2√n for all j =

1, . . . ,M . Applying Holder’s inequality, we get that with probability 1 − δ, itholds that

2ε>(ϕθL − ϕθ) ≤nτ

2|θL − θ|1.

Therefore, taking S = supp(θ) to be the support of θ, we get that the right-hand side of (3.7) is bounded by

|ϕθL − f |22 − |ϕθ − f |22 + nτ |θL − θ|1 ≤ 2nτ |θL − θ|1 + 2nτ |θ|1 − 2nτ |θL|1

= 2nτ |θLS − θ|1 + 2nτ |θ|1 − 2nτ |θLS |1≤ 4nτ |θLS − θ|1 (3.8)

with probability 1− δ.It implies that either MSE(ϕθL) ≤ MSE(ϕθ) or that

|θLSc − θSc |1 ≤ 3|θLS − θS |1 .

so that θ = θL − θ satisfies the cone condition (2.18). Using now the Cauchy-Schwarz inequality and Lemma 2.17, respectively, assuming that |θ|0 ≤ k, weget

4nτ |θLS − θ|1 ≤ 4nτ√|S||θLS − θ|2 ≤ 4τ

√2n|θ|0|ϕθL − ϕθ|2 .

Using now the inequality 2ab ≤ 2αa

2 + α2 b

2, we get

4nτ |θLS − θ|1 ≤16τ2n|θ|0

α+α

2|ϕθL − ϕθ|

22

≤ 16τ2n|θ|0α

+ α|ϕθL − f |22 + α|ϕθ − f |22

Combining this result with (3.7) and (3.8), we find

(1− α)MSE(ϕθL) ≤ (1 + α)MSE(ϕθ) +16τ2|θ|0

α.

To conclude the proof, it only remains to divide by 1− α on both sides of theabove inequality and take α = 1/2.

Maurey’s argument

In there is no sparse θ such that MSE(ϕθ) is small, Theorem 3.4 is uselesswhereas the Lasso may still enjoy slow rates. In reality, no one really be-lieves in the existence of sparse vectors but rather of approximately sparse


vectors. Zipf’s law would instead favor the existence of vectors θ with abso-lute coefficients that decay polynomially when ordered from largest to smallestin absolute value. This is the case for example if θ has a small `1 norm butis not sparse. For such θ, the Lasso estimator still enjoys slow rates as inTheorem 2.15, which can be easily extended to the misspecified case (see Prob-lem 3.2). As a result, it seems that the Lasso estimator is strictly better thanthe BIC estimator as long as incoherence holds since it enjoys both fast andslow rates, whereas the BIC estimator seems to be tailored to the fast rate.Fortunately, such vectors can be well approximated by sparse vectors in thefollowing sense: for any vector θ ∈ IRM such that |θ|1 ≤ 1, there exists a vectorθ′ that is sparse and for which MSE(ϕθ′) is not much larger than MSE(ϕθ). Thefollowing theorem quantifies exactly the tradeoff between sparsity and MSE. Itis often attributed to B. Maurey and was published by Pisier [Pis81]. This iswhy it is referred to as Maurey’s argument.

Theorem 3.6. Let {ϕ1, . . . , ϕM} be a dictionary normalized in such a waythat

max1≤j≤M

|ϕj |2 ≤ D√n .

Then for any integer k such that 1 ≤ k ≤M and any positive R, we have

minθ∈IRM

|θ|0≤k

MSE(ϕθ) ≤ minθ∈IRM

|θ|1≤R

MSE(ϕθ) +D2R2

k.

Proof. Defineθ ∈ argmin

θ∈IRM

|θ|1≤R

|ϕθ − f |22

and assume without loss of generality that |θ1| ≥ |θ2| ≥ . . . ≥ |θM |.Let now U ∈ IRn be a random vector with values in {0,±Rϕ1, . . . ,±RϕM}

defined by

IP(U = R sign(θj)ϕj) =|θj |R

, j = 1, . . . ,M,

IP(U = 0) = 1− |θ|1R

.

Note that IE[U ] = ϕθ and |U |2 ≤ RD√n. Let now U1, . . . , Uk be k independent

copies of U and define their average

U =1

k

k∑i=1

Ui .


Note that U = ϕθ for some θ ∈ IRM such that |θ|0 ≤ k. Moreover, using thePythagorean Theorem,

IE|f − U |22 = IE|f − ϕθ + ϕθ − U |22= IE|f − ϕθ|22 + |ϕθ − U |22

= |f − ϕθ|22 +IE|U − IE[U ]|22

k

≤ |f − ϕθ|22 +(RD

√n)2

k

To conclude the proof, note that

IE|f − U |22 = IE|f − ϕθ|22 ≥ min

θ∈IRM

|θ|0≤k

|f − ϕθ|22

and divide by n.

Maurey’s argument implies the following corollary.

Corollary 3.7. Assume that the assumptions of Theorem 3.4 hold and thatthe dictionary {ϕ1, . . . , ϕM} is normalized in such a way that

max1≤j≤M

|ϕj |2 ≤√n .

Then there exists a constant C > 0 such that the BIC estimator satisfies

MSE(ϕθbic) ≤ infθ∈IRM

{2MSE(ϕθ) + C

[σ2|θ|0 log(eM)

n∧ σ|θ|1

√log(eM)

n

]}+ C

σ2 log(1/δ)

n


Proof. Choosing α = 1/3 in Theorem 3.4 yields

MSE(ϕθbic) ≤ 2 infθ∈IRM

{MSE(ϕθ) + C

σ2|θ|0 log(eM)

n

}+ C

σ2 log(1/δ)

n

Let θ′ ∈ IRM . It follows from Maurey’s argument that for any k ∈ [M ], thereexists θ = θ(θ′, k) ∈ IRM such that |θ|0 = k and

MSE(ϕθ) ≤ MSE(ϕθ′) +2|θ′|21k

It implies that

MSE(ϕθ) + Cσ2|θ|0 log(eM)

n≤ MSE(ϕθ′) +

2|θ′|21k

+ Cσ2k log(eM)

n


and furthermore that

infθ∈IRM

MSE(ϕθ) + Cσ2|θ|0 log(eM)

n≤ MSE(ϕθ′) +

2|θ′|21k

+ Cσ2k log(eM)

n

Since the above bound holds for any θ′ ∈ RM and k ∈ [M ], we can take aninfimum with respect to both θ′ and k on the right-hand side to get

infθ∈IRM

{MSE(ϕθ) + C

σ2|θ|0 log(eM)

n

}≤ infθ′∈IRM

{MSE(ϕθ′) + C min

k

( |θ′|21k

+ Cσ2k log(eM)

n

)}.

To control the minimum over k, we need to consider three cases for the quantity

k =|θ′|1σ

√n

log(eM).

1. If 1 ≤ k ≤M , then we get

mink

( |θ′|21k

+ Cσ2k log(eM)

n

)≤ Cσ|θ′|1

√log(eM)

n

2. If k ≤ 1, then

|θ′|21 ≤ Cσ2 log(eM)

n,

which yields

mink

( |θ′|21k

+ Cσ2k log(eM)

n

)≤ Cσ

2 log(eM)

n

3. If k ≥M , thenσ2M log(eM)

n≤ C |θ

′|21M

.

Therefore, on the one hand, if M ≥ |θ|1σ√

log(eM)/n, we get

mink

( |θ′|21k

+ Cσ2k log(eM)

n

)≤ C |θ

′|21M≤ Cσ|θ′|1

√log(eM)

n.

On the other hand, if M ≤ |θ|1σ√

log(eM)/n, then for any θ ∈ IRM , we have

σ2|θ|0 log(eM)

n≤ σ2M log(eM)

n≤ Cσ|θ′|1

√log(eM)

n.

3.2. Nonparametric regression 81

Combined,

infθ′∈IRM

{MSE(ϕθ′) + C min

k

( |θ′|21k

+ Cσ2k log(eM)

n

)}≤ inf

θ′∈IRM

{MSE(ϕθ′) + Cσ|θ′|1

√log(eM

n+ C

σ2 log(eM)

n

},

which together with Theorem 3.4 yields the claim.

Note that this last result holds for any estimator that satisfies an oracleinequality with respect to the `0 norm as in Theorem 3.4. In particular, thisestimator need not be the BIC estimator. An example is the ExponentialScreening estimator of [RT11].

Maurey’s argument allows us to enjoy the best of both the `0 and the`1 world. The rate adapts to the sparsity of the problem and can be evengeneralized to `q-sparsity (see Problem 3.3). However, it is clear from the proofthat this argument is limited to squared `2 norms such as the one appearingin MSE and extension to other risk measures is non trivial. Some work hasbeen done for non-Hilbert spaces [Pis81, DDGS97] using more sophisticatedarguments.

3.2 NONPARAMETRIC REGRESSION

So far, the oracle inequalities that we have derived do not deal with theapproximation error MSE(ϕθ). We kept it arbitrary and simply hoped thatit was small. Note also that in the case of linear models, we simply assumedthat the approximation error was zero. As we will see in this section, thiserror can be quantified under natural smoothness conditions if the dictionaryof functions H = {ϕ1, . . . , ϕM} is chosen appropriately. In what follows, weassume for simplicity that d = 1 so that f : IR→ IR and ϕj : IR→ IR.

Fourier decomposition

Historically, nonparametric estimation was developed before high-dimensionalstatistics and most results hold for the case where the dictionaryH = {ϕ1, . . . , ϕM}forms an orthonormal system of L2([0, 1]):∫ 1

0

ϕ2j (x)dx = 1 ,

∫ 1

0

ϕj(x)ϕk(x)dx = 0, ∀ j 6= k .

We will also deal with the case where M =∞.When H is an orthonormal system, the coefficients θ∗j ∈ IR defined by

θ∗j =

∫ 1

0

f(x)ϕj(x)dx ,

are called Fourier coefficients of f .


Assume now that the regression function f admits the following decompo-sition

f =

∞∑j=1

θ∗jϕj .

There exists many choices for the orthonormal system and we give only twoas examples.

Example 3.8. Trigonometric basis. This is an orthonormal basis of L2([0, 1]).It is defined by

ϕ1 ≡ 1

ϕ2k(x) =√

2 cos(2πkx) ,

ϕ2k+1(x) =√

2 sin(2πkx) ,

for k = 1, 2, . . . and x ∈ [0, 1]. The fact that it is indeed an orthonormal systemcan be easily check using trigonometric identities.

The next example has received a lot of attention in the signal (sound, image,. . . ) processing community.

Example 3.9. Wavelets. Let ψ : IR → IR be a sufficiently smooth andcompactly supported function, called “mother wavelet”. Define the system offunctions

ψjk(x) = 2j/2ψ(2jx− k) , j, k ∈ Z .

It can be shown that for a suitable ψ, the dictionary {ψj,k, j, k ∈ Z} forms anorthonormal system of L2([0, 1]) and sometimes a basis. In the latter case, forany function g ∈ L2([0, 1]), it holds

g =

∞∑j=−∞

∞∑k=−∞

θjkψjk , θjk =

∫ 1

0

g(x)ψjk(x)dx .

The coefficients θjk are called wavelet coefficients of g.The simplest example is given by the Haar system obtained by taking ψ

to be the following piecewise constant function (see Figure 3.1). We will notgive more details about wavelets here but simply point the interested readerto [Mal09].

ψ(x) =

1 0 ≤ x < 1/2−1 1/2 ≤ x ≤ 10 otherwise

Sobolev classes and ellipsoids

We begin by describing a class of smooth functions where smoothness is under-stood in terms of its number of derivatives. Recall that f (k) denotes the k-thderivative of f .


x

y

−1

01

0.0 0.5 1.0

Figure 3.1. The Haar mother wavelet

Definition 3.10. Fix parameters β ∈ {1, 2, . . . } and L > 0. The Sobolev classof functions W (β, L) is defined by

W (β, L) ={f : [0, 1]→ IR : f ∈ L2([0, 1]) , f (β−1) is absolutely continuous and∫ 1

0

[f (β)]2 ≤ L2 , f (j)(0) = f (j)(1), j = 0, . . . , β − 1}

Any function f ∈ W (β, L) can represented1 as its Fourier expansion alongthe trigonometric basis:

f(x) = θ∗1ϕ1(x) +

∞∑k=1

(θ∗2kϕ2k(x) + θ∗2k+1ϕ2k+1(x)

), ∀ x ∈ [0, 1] ,

where θ∗ = {θ∗j }j≥1 is in the space of squared summable sequences `2(IN)defined by

`2(IN) ={θ :

∞∑j=1

θ2j <∞

}.

For any β > 0, define the coefficients

aj =

{jβ for j even

(j − 1)β for j odd(3.9)

With these coefficients, we can define the Sobolev class of functions in termsof Fourier coefficients.

1In the sense that

limk→∞

∫ 1

0|f(t)−

k∑j=1

θjϕj(t)|2dt = 0


Theorem 3.11. Fix β ≥ 1 and L > 0 and let {ϕj}j≥1 denote the trigonometricbasis of L2([0, 1]). Moreover, let {aj}j≥1 be defined as in (3.9). A functionf ∈W (β, L) can be represented as

f =

∞∑j=1

θ∗jϕj ,

where the sequence {θ∗j }j≥1 belongs to Sobolev ellipsoid of `2(IN) defined by

Θ(β,Q) ={θ ∈ `2(IN) :

∞∑j=1

a2jθ

2j ≤ Q

}for Q = L2/π2β.

Proof. Let us first define the Fourier coefficients {sk(j)}k≥1 of the jth deriva-tive f (j) of f for j = 1, . . . , β:

s1(j) =

∫ 1

0

f (j)(t)dt = f (j−1)(1)− f (j−1)(0) = 0 ,

s2k(j) =√

2

∫ 1

0

f (j)(t) cos(2πkt)dt ,

s2k+1(j) =√

2

∫ 1

0

f (j)(t) sin(2πkt)dt ,

The Fourier coefficients of f are given by θk = sk(0).Using integration by parts, we find that

s2k(β) =√

2f (β−1)(t) cos(2πkt)∣∣∣10

+ (2πk)√

2

∫ 1

0

f (β−1)(t) sin(2πkt)dt

=√

2[f (β−1)(1)− f (β−1)(0)] + (2πk)√

2

∫ 1

0

f (β−1)(t) sin(2πkt)dt

= (2πk)s2k+1(β − 1) .

Moreover,

s2k+1(β) =√

2f (β−1)(t) sin(2πkt)∣∣∣10− (2πk)

√2

∫ 1

0

f (β−1)(t) cos(2πkt)dt

= −(2πk)s2k(β − 1) .

In particular, it yields

s2k(β)2 + s2k+1(β)2 = (2πk)2[s2k(β − 1)2 + s2k+1(β − 1)2

]By induction, we find that for any k ≥ 1,

s2k(β)2 + s2k+1(β)2 = (2πk)2β(θ2

2k + θ22k+1

)


Next, it follows for the definition (3.9) of aj that

∞∑k=1

(2πk)2β(θ2

2k + θ22k+1

)= π2β

∞∑k=1

a22kθ

22k + π2β

∞∑k=1

a22k+1θ

22k+1

= π2β∞∑j=1

a2jθ

2j .

Together with the Parseval identity, it yields∫ 1

0

(f (β)(t)

)2dt =

∞∑k=1

s2k(β)2 + s2k+1(β)2 = π2β∞∑j=1

a2jθ

2j .

To conclude, observe that since f ∈W (β, L), we have∫ 1

0

(f (β)(t)

)2dt ≤ L2 ,

so that θ ∈ Θ(β, L2/π2β) .

It can actually be shown that the reciprocal is true, that is, any functionwith Fourier coefficients in Θ(β,Q) belongs to if W (β, L), but we will not beneeding this.

In what follows, we will define smooth functions as functions with Fouriercoefficients (with respect to the trigonometric basis) in a Sobolev ellipsoid. Byextension, we write f ∈ Θ(β,Q) in this case and consider any real value for β.

Proposition 3.12. The Sobolev ellipsoids enjoy the following properties

(i) For any Q > 0,

0 < β′ < β ⇒ Θ(β,Q) ⊂ Θ(β′, Q).

(ii) For any Q > 0,

β >1

2⇒ f is continuous.

The proof is left as an exercise (Problem 3.6).It turns out that the first n functions in the trigonometric basis are not only

orthonormal with respect to the inner product of L2, but also with respect tothe inner predictor associated with fixed design, 〈f, g〉 := 1

n

∑ni=1 f(Xi)g(Xi),

when the design is chosen to be regular, i.e., Xi = (i− 1)/n, i = 1, . . . , n.

Lemma 3.13. Assume that {X1, . . . , Xn} is the regular design, i.e., Xi =(i − 1)/n. Then, for any M ≤ n − 1, the design matrix Φ = {ϕj(Xi)} 1≤i≤n

1≤j≤Msatisfies the ORT condition.


Proof. Fix j, j′ ∈ {1, . . . , n− 1}, j 6= j′, and consider the inner product ϕ>j ϕj′ .Write kj = bj/2c for the integer part of j/2 and define the vectors a, b, a′, b′ ∈IRn with coordinates such that e

i2πkjs

n = as+1 + ibs+1 and ei2πk

j′sn = a′s+1 + ib′s+1

for s ∈ {0, . . . , n− 1}. It holds that

1

2ϕ>j ϕj′ ∈ {a>a′ , b>b′ , b>a′ , a>b′} ,

depending on the parity of j and j′.On the one hand, observe that if kj 6= kj′ , we have for any σ ∈ {−1,+1},

n−1∑s=0

ei2πkjs

n eσi2πk

j′sn =

n−1∑s=0

ei2π(kj+σk

j′ )sn = 0 .

On the other hand

n−1∑s=0

ei2πkjs

n eσi2πk

j′sn = (a+ ib)>(a′ + σib′) = a>a′ − σb>b′ + i

[b>a′ + σa>b′

]so that a>a′ = ±b>b′ = 0 and b>a′ = ±a>b′ = 0 whenever, kj 6= kj′ . It yieldsϕ>j ϕj′ = 0.

Next, consider the case where kj = kj′ so that

n−1∑s=0

ei2πkjs

n eσi2πk

j′sn =

{0 if σ = 1n if σ = −1

.

On the one hand if j 6= j′, it can only be the case that ϕ>j ϕj′ ∈ {b>a′ , a>b′}but the same argument as above yields b>a′ = ±a>b′ = 0 since the imaginarypart of the inner product is still 0. Hence, in that case, ϕ>j ϕj′ = 0. On the

other hand, if j = j′, then a = a′ and b = b′ so that it yields a>a′ = |a|22 = nand b>b′ = |b|22 = n which is equivalent to ϕ>j ϕj = |ϕj |22 = n. Therefore, thedesign matrix Φ is such that

Φ>Φ = nIM .

Integrated squared error

As mentioned in the introduction of this chapter, the smoothness assumptionallows us to control the approximation error. Before going into the details, letus gain some insight. Note first that if θ ∈ Θ(β,Q), then a2

jθ2j → 0 as j → ∞

so that |θj | = o(j−β). Therefore, the θjs decay polynomially to zero and itmakes sense to approximate f by its truncated Fourier series

M∑j=1

θ∗jϕj =: ϕMθ∗


for any fixed M . This truncation leads to a systematic error that vanishes asM →∞. We are interested in understanding the rate at which this happens.

The Sobolev assumption allows us to control precisely this error as a func-tion of the tunable parameter M and the smoothness β.

Lemma 3.14. For any integer M ≥ 1, and f ∈ Θ(β,Q), β > 1/2, it holds

‖ϕMθ∗ − f‖2L2=∑j>M

|θ∗j |2 ≤ QM−2β . (3.10)

and for M = n− 1, we have

|ϕn−1θ∗ − f |

22 ≤ 2n

(∑j≥n

|θ∗j |)2

. Qn2−2β . (3.11)

Proof. Note that for any θ ∈ Θ(β,Q), if β > 1/2, then

∞∑j=2

|θj | =∞∑j=2

aj |θj |1

aj

≤

√√√√ ∞∑j=2

a2jθ

2j

∞∑j=2

1

a2j

by Cauchy-Schwarz

≤

√√√√Q

∞∑j=1

1

j2β<∞

Since {ϕj}j forms an orthonormal system in L2([0, 1]), we have

minθ∈IRM

‖ϕθ − f‖2L2= ‖ϕMθ∗ − f‖2L2

=∑j>M

|θ∗j |2 .

When θ∗ ∈ Θ(β,Q), we have∑j>M

|θ∗j |2 =∑j>M

a2j |θ∗j |2

1

a2j

≤ 1

a2M+1

Q ≤ Q

M2β.

To prove the second part of the lemma, observe that

|ϕn−1θ∗ − f |2 =

∣∣∑j≥n

θ∗jϕj∣∣2≤ 2√

2n∑j≥n

|θ∗j | ,

where in the last inequality, we used the fact that for the trigonometric basis|ϕj |2 ≤

√2n, j ≥ 1 regardless of the choice of the design X1, . . . , Xn. When

θ∗ ∈ Θ(β,Q), we have

∑j≥n

|θ∗j | =∑j≥n

aj |θ∗j |1

aj≤√∑j≥n

a2j |θ∗j |2

√∑j≥n

1

a2j

.√Qn

12−β .


Note that the truncated Fourier series ϕMθ∗ is an oracle: this is what wesee when we view f through the lens of functions with only low frequencyharmonics.

To estimate ϕθ∗ = ϕMθ∗ , consider the estimator ϕθls where

θls ∈ argminθ∈IRM

n∑i=1

(Yi − ϕθ(Xi)

)2,

which should be such that ϕθls is close to ϕθ∗ . For this estimator, we haveproved (Theorem 3.3) an oracle inequality for the MSE that is of the form

|ϕMθls− f |22 ≤ inf

θ∈IRM|ϕMθ − f |22 + Cσ2M log(1/δ) , C > 0 .

It yields

|ϕMθls− ϕMθ∗ |22 ≤ 2(ϕM

θls− ϕMθ∗ )>(f − ϕMθ∗ ) + Cσ2M log(1/δ)

= 2(ϕMθls− ϕMθ∗ )>(

∑j>M

θ∗jϕj) + Cσ2M log(1/δ)

= 2(ϕMθls− ϕMθ∗ )>(

∑j≥n

θ∗jϕj) + Cσ2M log(1/δ) ,

where we used Lemma 3.13 in the last equality. Together with (3.11) andYoung’s inequality 2ab ≤ αa2 + b2/α, a, b ≥ 0 for any α > 0, we get

2(ϕMθls− ϕMθ∗ )>(

∑j≥n

θ∗jϕj) ≤ α|ϕMθls − ϕMθ∗ |22 +

C

αQn2−2β ,

for some positive constant C when θ∗ ∈ Θ(β,Q). As a result,

|ϕMθls− ϕMθ∗ |22 .

1

α(1− α)Qn2−2β +

σ2M

1− αlog(1/δ) (3.12)

for any α ∈ (0, 1). Since, Lemma 3.13 implies, |ϕMθls− ϕMθ∗ |22 = n‖ϕM

θls−

ϕMθ∗‖2L2([0,1]), we have proved the following theorem.

Theorem 3.15. Fix β ≥ (1 +√

5)/4 ' 0.81, Q > 0, δ > 0 and assume thegeneral regression model (3.1) with f ∈ Θ(β,Q) and ε ∼ subGn(σ2), σ2 ≤ 1.

Moreover, let M = dn1

2β+1 e and n be large enough so that M ≤ n−1. Then the

least squares estimator θls defined in (3.2) with {ϕj}Mj=1 being the trigonometricbasis, satisfies with probability 1− δ, for n large enough,

‖ϕMθls− f‖2L2([0,1]) . n−

2β2β+1 (1 + σ2 log(1/δ)) .

where the constant factors may depend on β,Q and σ.


Proof. Choosing α = 1/2 for example and absorbing Q in the constants, weget from (3.12) and Lemma 3.13 that for M ≤ n− 1,

‖ϕMθls− ϕMθ∗‖2L2([0,1]) . n1−2β + σ2M log(1/δ)

n.

Using now Lemma 3.14 and σ2 ≤ 1, we get

‖ϕMθls− f‖2L2([0,1]) .M−2β + n1−2β + σ2M log(1/δ)

n.

Taking M = dn1

2β+1 e ≤ n− 1 for n large enough yields

‖ϕMθls− f‖2L2([0,1]) . n−

2β2β+1 + n1−2βσ2 log(1/δ) .

To conclude the proof, simply note that for the prescribed β, we have n1−2β ≤n−

2β2β+1 .

Adaptive estimation

The rate attained by the projection estimator ϕθls withM = dn1

2β+1 e is actuallyoptimal so, in this sense, it is a good estimator. Unfortunately, its implementa-tion requires the knowledge of the smoothness parameter β which is typicallyunknown, to determine the level M of truncation. The purpose of adaptive es-timation is precisely to adapt to the unknown β, that is to build an estimator

that does not depend on β and yet, attains a rate of the order of Cn−2β

2β+1 (upto a logarithmic lowdown). To that end, we will use the oracle inequalities forthe BIC and Lasso estimator defined in (3.3) and (3.4) respectively. In view ofLemma 3.13, the design matrix Φ actually satisfies the assumption ORT whenwe work with the trigonometric basis. This has two useful implications:

1. Both estimators are actually thresholding estimators and can thereforebe implemented efficiently

2. The condition INC(k) is automatically satisfied for any k ≥ 1.

These observations lead to the following corollary.

Corollary 3.16. Fix β > (1 +√

5)/4 ' 0.81, Q > 0, δ > 0 and n large enough

to ensure n − 1 ≥ dn1

2β+1 e assume the general regression model (3.1) withf ∈ Θ(β,Q) and ε ∼ subGn(σ2), σ2 ≤ 1. Let {ϕj}n−1

j=1 be the trigonometric

basis. Denote by ϕn−1

θbic(resp. ϕn−1

θL) the BIC (resp. Lasso) estimator defined

in (3.3) (resp. (3.4)) over IRn−1 with regularization parameter given by (3.5)

(resp. (3.6)). Then ϕn−1

θ, where θ ∈ {θbic, θL} satisfies with probability 1− δ,

‖ϕn−1

θ− f‖2L2([0,1]) . (n/ log n)−

2β2β+1 (1 + σ2 log(1/δ)) ,

where the constants may depend on β and Q.


Proof. For θ ∈ {θbic, θL}, adapting the proofs of Theorem 3.4 for the BICestimator and Theorem 3.5 for the Lasso estimator, for any θ ∈ IRn−1, withprobability 1− δ

|ϕn−1

θ− f |22 ≤

1 + α

1− α|ϕn−1θ − f |22 +R(|θ|0) .

where

R(|θ|0) :=Cσ2

α(1− α)|θ0| log(en/δ)

It yields

|ϕn−1

θ− ϕn−1

θ |22 ≤2α

1− α|ϕn−1θ − f |22 + 2(ϕn−1

θ− ϕn−1

θ )>(f − ϕn−1θ ) +R(|θ|0)

≤( 2α

1− α+

1

α

)|ϕn−1θ − f |22 + α|ϕn−1

θ− ϕn−1

θ |22 +R(|θ|0) ,

where we used Young’s inequality once again. Let M ∈ N be determined laterand choose now α = 1/2 and θ = θ∗M , where θ∗M is equal to θ∗ on its first Mcoordinates and 0 otherwise so that ϕn−1

θ∗M= ϕMθ∗ . It yields

|ϕn−1

θ−ϕn−1

θ∗M|22 . |ϕn−1

θ∗M− f |22 +R(M) . |ϕn−1

θ∗M−ϕn−1

θ∗ |22 + |ϕn−1

θ∗ − f |22 +R(M)

Next, it follows from (3.11) that |ϕn−1θ∗ − f |22 . Qn2−2β . Together with

Lemma 3.13, it yields

‖ϕn−1

θ− ϕn−1

θ∗M‖2L2([0,1]) . ‖ϕ

n−1θ∗ − ϕ

n−1θ∗M‖2L2([0,1]) +Qn1−2β +

R(M)

n.

Moreover, using (3.10), we find that

‖ϕn−1

θ− f‖2L2([0,1]) .M−2β +Qn1−2β +

M

nσ2 log(en/δ) .

To conclude the proof, choose M = d(n/ log n)1

2β+1 e and observe that the choiceof β ensures that n1−2β .M−2β .

While there is sometimes a (logarithmic) price to pay for adaptation, itturns out that the extra logarithmic factor can be removed by a clever use ofblocks (see [Tsy09, Chapter 3]). The reason why we get this extra logarithmicfactor here is because we use a hammer that’s too big. Indeed, BIC and Lassoallow for “holes” in the Fourier decomposition, so we only get part of theirpotential benefits.

3.3. Problem Set 91

3.3 PROBLEM SET

Problem 3.1. Show that the least-squares estimator θls defined in (3.2) sat-isfies the following exact oracle inequality:

IEMSE(ϕθls) ≤ infθ∈IRM

MSE(ϕθ) + Cσ2M

n

for some constant C to be specified.

Problem 3.2. Assume that ε ∼ subGn(σ2) and the vectors ϕj are normalizedin such a way that maxj |ϕj |2 ≤

√n. Show that there exists a choice of τ

such that the Lasso estimator θL with regularization parameter 2τ satisfies thefollowing exact oracle inequality:

MSE(ϕθL) ≤ infθ∈IRM

{MSE(ϕθ) + Cσ|θ|1

√logM

n

}with probability at least 1−M−c for some positive constants C, c.

Problem 3.3. Let {ϕ1, . . . , ϕM} be a dictionary normalized in such a waythat maxj |ϕj |2 ≤ D

√n. Show that for any integer k such that 1 ≤ k ≤M , we

have

minθ∈IRM

|θ|0≤2k

MSE(ϕθ) ≤ minθ∈IRM

|θ|w`q≤1

MSE(ϕθ) + CqD2

(M

1q − k

1q)2

k,

where |θ|w`q denotes the weak `q norm and q is such that 1q + 1

q = 1, for q > 1.

Problem 3.4. Show that the trigonometric basis and the Haar system indeedform an orthonormal system of L2([0, 1]).

Problem 3.5. Consider the n× d random matrix Φ = {ϕj(Xi)}1≤i≤n1≤j≤d

where

X1, . . . , Xn are i.i.d uniform random variables on the interval [0, 1] and φj isthe trigonometric basis as defined in Example 3.8. Show that Φ satisfies INC(k)with probability at least .9 as long as n ≥ Ck2 log(d) for some large enoughconstant C > 0.

Problem 3.6. If f ∈ Θ(β,Q) for β > 1/2 and Q > 0, then f is continuous.

Chapter

4Minimax Lower Bounds

In the previous chapters, we have proved several upper bounds and the goal ofthis chapter is to assess their optimality. Specifically, our goal is to answer thefollowing questions:

1. Can our analysis be improved? In other words: do the estimators thatwe have studied actually satisfy better bounds?

2. Can any estimator improve upon these bounds?

Both questions ask about some form of optimality. The first one is aboutoptimality of an estimator, whereas the second one is about optimality of abound.

The difficulty of these questions varies depending on whether we are lookingfor a positive or a negative answer. Indeed, a positive answer to these questionssimply consists in finding a better proof for the estimator we have studied(question 1.) or simply finding a better estimator, together with a proof thatit performs better (question 2.). A negative answer is much more arduous.For example, in question 2., it is a statement about all estimators. How canthis be done? The answer lies in information theory (see [CT06] for a niceintroduction).

In this chapter, we will see how to give a negative answer to question 2. Itwill imply a negative answer to question 1.

4.1 OPTIMALITY IN A MINIMAX SENSE

Consider the Gaussian Sequence Model (GSM) where we observe Y =(Y1, . . . , Yd)

>, defined by

Yi = θ∗i + εi , i = 1, . . . , d , (4.1)

92

4.1. Optimality in a minimax sense 93

Θ φ(Θ) Estimator Result

IRd σ2d

nθls Theorem 2.2

B1 σ

√log d

nθlsB1

Theorem 2.4

B0(k)σ2k

nlog(ed/k) θlsB0(k) Corollaries 2.8-2.9

Table 41. Rate φ(Θ) obtained for different choices of Θ.

where ε = (ε1, . . . , εd)> ∼ Nd(0, σ

2

n Id), θ∗ = (θ∗1 , . . . , θ

∗d)> ∈ Θ is the parameter

of interest and Θ ⊂ IRd is a given set of parameters. We will need a more precisenotation for probabilities and expectations throughout this chapter. Denote byIPθ∗ and IEθ∗ the probability measure and corresponding expectation that areassociated to the distribution of Y from the GSM (4.1).

Recall that GSM is a special case of the linear regression model when thedesign matrix satisfies the ORT condition. In this case, we have proved severalperformance guarantees (upper bounds) for various choices of Θ that can beexpressed either in the form

IE[|θn − θ∗|22

]≤ Cφ(Θ) (4.2)

or the form|θn − θ∗|22 ≤ Cφ(Θ) , with prob. 1− d−2 (4.3)

For some constant C. The rates φ(Θ) for different choices of Θ that we haveobtained are gathered in Table 41 together with the estimator (and the corre-sponding result from Chapter 2) that was employed to obtain this rate. Canany of these results be improved? In other words, does there exists anotherestimator θ such that supθ∗∈Θ IE|θ − θ∗|22 � φ(Θ)?

A first step in this direction is the Cramer-Rao lower bound [Sha03] thatallows us to prove lower bounds in terms of the Fisher information. Neverthe-less, this notion of optimality is too stringent and often leads to nonexistenceof optimal estimators. Rather, we prefer here the notion of minimax optimalitythat characterizes how fast θ∗ can be estimated uniformly over Θ.

Definition 4.1. We say that an estimator θn is minimax optimal over Θ if itsatisfies (4.2) and there exists C ′ > 0 such that

infθ

supθ∈Θ

IEθ|θ − θ|22 ≥ C ′φ(Θ) (4.4)

where the infimum is taker over all estimators (i.e., measurable functions ofY). Moreover, φ(Θ) is called minimax rate of estimation over Θ.

4.2. Reduction to finite hypothesis testing 94

Note that minimax rates of convergence φ are defined up to multiplicativeconstants. We may then choose this constant such that the minimax rate hasa simple form such as σ2d/n as opposed to 7σ2d/n for example.

This definition can be adapted to rates that hold with high probability. Aswe saw in Chapter 2 (Cf. Table 41), the upper bounds in expectation and thosewith high probability are of the same order of magnitude. It is also the casefor lower bounds. Indeed, observe that it follows from the Markov inequalitythat for any A > 0,

IEθ[φ−1(Θ)|θ − θ|22

]≥ AIPθ

[φ−1(Θ)|θ − θ|22 > A

](4.5)

Therefore, (4.4) follows if we prove that

infθ

supθ∈Θ

IPθ[|θ − θ|22 > Aφ(Θ)

]≥ C

for some positive constants A and C”. The above inequality also implies a lowerbound with high probability. We can therefore employ the following alternatedefinition for minimax optimality.

Definition 4.2. We say that an estimator θ is minimax optimal over Θ if itsatisfies either (4.2) or (4.3) and there exists C ′ > 0 such that

infθ

supθ∈Θ

IPθ[|θ − θ|22 > φ(Θ)

]≥ C ′ (4.6)

where the infimum is taker over all estimators (i.e., measurable functions ofY). Moreover, φ(Θ) is called minimax rate of estimation over Θ.

4.2 REDUCTION TO FINITE HYPOTHESIS TESTING

Minimax lower bounds rely on information theory and follow from a simpleprinciple: if the number of observations is too small, it may be hard to distin-guish between two probability distributions that are close to each other. Forexample, given n i.i.d. observations, it is impossible to reliably decide whetherthey are drawn from N (0, 1) or N ( 1

n , 1). This simple argument can be madeprecise using the formalism of statistical hypothesis testing. To do so, we reduceour estimation problem to a testing problem. The reduction consists of twosteps.

1. Reduction to a finite number of parameters. In this step the goalis to find the largest possible number of parameters θ1, . . . , θM ∈ Θ underthe constraint that

|θj − θk|22 ≥ 4φ(Θ) . (4.7)

This problem boils down to a packing of the set Θ.

Then we can use the following trivial observations:

infθ

supθ∈Θ

IPθ[|θ − θ|22 > φ(Θ)

]≥ inf

θmax

1≤j≤MIPθj

[|θ − θj |22 > φ(Θ)

].

4.3. Lower bounds based on two hypotheses 95

2. Reduction to a hypothesis testing problem. In this second step,the necessity of the constraint (4.7) becomes apparent.

For any estimator θ, define the minimum distance test ψ(θ) that is asso-ciated to it by

ψ(θ) = argmin1≤j≤M

|θ − θj |2 ,

with ties broken arbitrarily.

Next observe that if, for some j = 1, . . . ,M , ψ(θ) 6= j, then there exists

k 6= j such that |θ − θk|2 ≤ |θ − θj |2. Together with the reverse triangleinequality it yields

|θ − θj |2 ≥ |θj − θk|2 − |θ − θk|2 ≥ |θj − θk|2 − |θ − θj |2

so that

|θ − θj |2 ≥1

2|θj − θk|2

Together with constraint (4.7), it yields

|θ − θj |22 ≥ φ(Θ)

As a result,

infθ

max1≤j≤M

IPθj[|θ − θj |22 > φ(Θ)

]≥ inf

θmax

1≤j≤MIPθj

[ψ(θ) 6= j

]≥ inf

ψmax

1≤j≤MIPθj

[ψ 6= j

]where the infimum is taken over all tests ψ based on Y and that takevalues in {1, . . . ,M}.

Conclusion: it is sufficient for proving lower bounds to find θ1, . . . , θM ∈ Θsuch that |θj − θk|22 ≥ 4φ(Θ) and

infψ

max1≤j≤M

IPθj[ψ 6= j

]≥ C ′ .

The above quantity is called minimax probability of error. In the next sections,we show how it can be bounded from below using arguments from informationtheory. For the purpose of illustration, we begin with the simple case whereM = 2 in the next section.

4.3 LOWER BOUNDS BASED ON TWO HYPOTHESES

The Neyman-Pearson Lemma and the total variation distance

Consider two probability measures IP0 and IP1 and observations X drawn fromeither IP0 or IP1. We want to know which distribution X comes from. Itcorresponds to the following statistical hypothesis problem:

H0 : Z ∼ IP0

H1 : Z ∼ IP1


A test ψ = ψ(Z) ∈ {0, 1} indicates which hypothesis should be true. Anytest ψ can make two types of errors. It can commit either an error of type I(ψ = 1 whereas Z ∼ IP0) or an error of type II (ψ = 0 whereas Z ∼ IP1). Ofcourse, the test may also be correct. The following fundamental result, calledthe Neyman Pearson Lemma indicates that any test ψ is bound to commit oneof these two types of error with positive probability unless IP0 and IP1 haveessentially disjoint support.

Let ν be a sigma finite measure satisfying IP0 � ν and IP1 � ν. For examplewe can take ν = IP0 + IP1. It follows from the Radon-Nikodym theorem [Bil95]that both IP0 and IP1 admit probability densities with respect to ν. We denotethem by p0 and p1 respectively. For any function f , we write for simplicity∫

f =

∫f(x)ν(dx)

Lemma 4.3 (Neyman-Pearson). Let IP0 and IP1 be two probability measures.Then for any test ψ, it holds

IP0(ψ = 1) + IP1(ψ = 0) ≥∫

min(p0, p1)

Moreover, equality holds for the Likelihood Ratio test ψ? = 1I(p1 ≥ p0).

Proof. Observe first that

IP0(ψ? = 1) + IP1(ψ? = 0) =

∫ψ∗=1

p0 +

∫ψ∗=0

p1

=

∫p1≥p0

p0 +

∫p1<p0

p1

=

∫p1≥p0

min(p0, p1) +

∫p1<p0

min(p0, p1)

=

∫min(p0, p1) .

Next for any test ψ, define its rejection region R = {ψ = 1}. Let R? = {p1 ≥p0} denote the rejection region of the likelihood ratio test ψ?. It holds

IP0(ψ = 1) + IP1(ψ = 0) = 1 + IP0(R)− IP1(R)

= 1 +

∫R

p0 − p1

= 1 +

∫R∩R?

p0 − p1 +

∫R∩(R?)c

p0 − p1

= 1−∫R∩R?

|p0 − p1|+∫R∩(R?)c

|p0 − p1|

= 1 +

∫|p0 − p1|

[1I(R ∩ (R?)c)− 1I(R ∩R?)

]The above quantity is clearly minimized for R = R?.


The lower bound in the Neyman-Pearson lemma is related to a well knownquantity: the total variation distance.

Definition-Proposition 4.4. The total variation distance between two prob-ability measures IP0 and IP1 on a measurable space (X ,A) is defined by

TV(IP0, IP1) = supR∈A|IP0(R)− IP1(R)| (i)

= supR∈A

∣∣∣ ∫R

p0 − p1

∣∣∣ (ii)

=1

2

∫|p0 − p1| (iii)

= 1−∫

min(p0, p1) (iv)

= 1− infψ

[IP0(ψ = 1) + IP1(ψ = 0)

](v)

where the infimum above is taken over all tests.

Proof. Clearly (i) = (ii) and the Neyman-Pearson Lemma gives (iv) = (v).Moreover, by identifying a test ψ to its rejection region, it is not hard to seethat (i) = (v). Therefore it remains only to show that (iii) is equal to anyof the other expressions. Hereafter, we show that (iii) = (iv). To that end,observe that∫

|p0 − p1| =∫p1≥p0

p1 − p0 +

∫p1<p0

p0 − p1

=

∫p1≥p0

p1 +

∫p1<p0

p0 −∫

min(p0, p1)

= 1−∫p1<p0

p1 + 1−∫p1≥p0

p0 −∫

min(p0, p1)

= 2− 2

∫min(p0, p1)

In view of the Neyman-Pearson lemma, it is clear that if we want to provelarge lower bounds, we need to find probability distributions that are close intotal variation. Yet, this conflicts with constraint (4.7) and a tradeoff needs tobe achieved. To that end, in the Gaussian sequence model, we need to be able

to compute the total variation distance between N (θ0,σ2

n Id) and N (θ1,σ2

n Id).None of the expression in Definition-Proposition 4.4 gives an easy way to doso. The Kullback-Leibler divergence is much more convenient.


The Kullback-Leibler divergence

Definition 4.5. The Kullback-Leibler divergence between probability mea-sures IP1 and IP0 is given by

KL(IP1, IP0) =

∫

log(dIP1

dIP0

)dIP1 , if IP1 � IP0

∞ , otherwise .

It can be shown [Tsy09] that the integral is always well defined when IP1 �IP0 (though it can be equal to ∞ even in this case). Unlike the total variationdistance, the Kullback-Leibler divergence is not a distance. Actually, it is noteven symmetric. Nevertheless, it enjoys properties that are very useful for ourpurposes.

Proposition 4.6. Let IP and Q be two probability measures. Then

1. KL(IP,Q) ≥ 0.

2. The function (IP,Q) 7→ KL(IP,Q) is convex.

3. If IP and Q are product measures, i.e.,

IP =

n⊗i=1

IPi and Q =

n⊗i=1

Qi

then

KL(IP,Q) =

n∑i=1

KL(IPi,Qi) .

Proof. If IP is not absolutely continuous then the result is trivial. Next, assumethat IP� Q and let X ∼ IP.

1. Observe that by Jensen’s inequality,

KL(IP,Q) = −IE log(dQ

dIP(X)

)≥ − log IE

(dQdIP

(X))

= − log(1) = 0 .

2. Consider the function f : (x, y) 7→ x log(x/y) and compute its Hessian:

∂2f

∂x2=

1

x,

∂2f

∂y2=

x

y2,

∂2f

∂x∂y= −1

y,

We check that the matrix H =

( 1x − 1

y

− 1y

xy2

)is positive semidefinite for all

x, y > 0. To that end, compute its determinant and trace:

det(H) =1

y2− 1

y2= 0 ,Tr(H) =

1

x

x

y2> 0 .

Therefore, H has one null eigenvalue and one positive one and must thereforebe positive semidefinite so that the function f is convex on (0,∞)2. Since the


KL divergence is a sum of such functions, it is also convex on the space ofpositive measures.

3. Note that if X = (X1, . . . , Xn),

KL(IP,Q) = IE log(dIP

dQ(X)

)=

n∑i=1

∫log(dIPi

dQi(Xi)

)dIP1(X1) · · · dIPn(Xn)

=

n∑i=1

∫log(dIPi

dQi(Xi)

)dIPi(Xi)

=

n∑i=1

KL(IPi,Qi)

Point 2. in Proposition 4.6 is particularly useful in statistics where obser-vations typically consist of n independent random variables.

Example 4.7. For any θ ∈ IRd, let Pθ denote the distribution of Y ∼N (θ, σ2Id). Then

KL(Pθ, Pθ′) =

d∑i=1

(θi − θ′i)2

2σ2=|θ − θ′|22

2σ2.

The proof is left as an exercise (see Problem 4.1).

The Kullback-Leibler divergence is easier to manipulate than the total vari-ation distance but only the latter is related to the minimax probability of error.Fortunately, these two quantities can be compared using Pinsker’s inequality.We prove here a slightly weaker version of Pinsker’s inequality that will besufficient for our purpose. For a stronger statement, see [Tsy09], Lemma 2.5.

Lemma 4.8 (Pinsker’s inequality.). Let IP and Q be two probability measuressuch that IP� Q. Then

TV(IP,Q) ≤√

KL(IP,Q) .


Proof. Note that

KL(IP,Q) =

∫pq>0

p log(pq

)= −2

∫pq>0

p log(√q

p

)= −2

∫pq>0

p log([√q

p− 1]

+ 1)

≥ −2

∫pq>0

p[√q

p− 1]

(by Jensen)

= 2− 2

∫√pq

Next, note that(∫ √pq)2

=(∫ √

max(p, q) min(p, q))2

≤∫

max(p, q)

∫min(p, q) (by Cauchy-Schwarz)

=[2−

∫min(p, q)

] ∫min(p, q)

=(1 + TV(IP,Q)

)(1− TV(IP,Q)

)= 1− TV(IP,Q)2

The two displays yield

KL(IP,Q) ≥ 2− 2√

1− TV(IP,Q)2 ≥ TV(IP,Q)2 ,

where we used the fact that 0 ≤ TV(IP,Q) ≤ 1 and√

1− x ≤ 1 − x/2 forx ∈ [0, 1].

Pinsker’s inequality yields the following theorem for the GSM.

Theorem 4.9. Assume that Θ contains two hypotheses θ0 and θ1 such that|θ0 − θ1|22 = 8α2σ2/n for some α ∈ (0, 1/2). Then

infθ

supθ∈Θ

IPθ(|θ − θ|22 ≥2ασ2

n) ≥ 1

2− α .

Proof. Write for simplicity IPj = IPθj , j = 0, 1. Recall that it follows from the

4.4. Lower bounds based on many hypotheses 101

reduction to hypothesis testing that

infθ

supθ∈Θ

IPθ(|θ − θ|22 ≥2ασ2

n) ≥ inf

ψmaxj=0,1

IPj(ψ 6= j)

≥ 1

2infψ

(IP0(ψ = 1) + IP1(ψ = 0)

)=

1

2

[1− TV(IP0, IP1)

](Prop.-def. 4.4)

≥ 1

2

[1−

√KL(IP1, IP0)

](Lemma 4.8)

=1

2

[1−

√n|θ1 − θ0|22

2σ2

](Example 4.7)

=1

2

[1− 2α

]Clearly the result of Theorem 4.9 matches the upper bound for Θ = IRd

only for d = 1. How about larger d? A quick inspection of our proof showsthat our technique, in its present state, cannot yield better results. Indeed,there are only two known candidates for the choice of θ∗. With this knowledge,one can obtain upper bounds that do not depend on d by simply projectingY onto the linear span of θ0, θ1 and then solving the GSM in two dimensions.To obtain larger lower bounds, we need to use more than two hypotheses. Inparticular, in view of the above discussion, we need a set of hypotheses thatspans a linear space of dimension proportional to d. In principle, we shouldneed at least order d hypotheses but we will actually need much more.

4.4 LOWER BOUNDS BASED ON MANY HYPOTHESES

The reduction to hypothesis testing from Section 4.2 allows us to use morethan two hypotheses. Specifically, we should find θ1, . . . , θM such that

infψ

max1≤j≤M

IPθj[ψ 6= j

]≥ C ′ ,

for some positive constant C ′. Unfortunately, the Neyman-Pearson Lemma nolonger exists for more than two hypotheses. Nevertheless, it is possible to relatethe minimax probability of error directly to the Kullback-Leibler divergence,without involving the total variation distance. This is possible using a wellknown result from information theory called Fano’s inequality.

Theorem 4.10 (Fano’s inequality). Let P1, . . . , PM ,M ≥ 2 be probability dis-tributions such that Pj � Pk, ∀ j, k. Then

infψ

max1≤j≤M

Pj[ψ(X) 6= j

]≥ 1−

1M2

∑Mj,k=1 KL(Pj , Pk) + log 2

logM

where the infimum is taken over all tests with values in {1, . . . ,M}.


Proof. Define

pj = Pj(ψ = j) and qj =1

M

M∑k=1

Pk(ψ = j)

so that

p =1

M

M∑j=1

pj ∈ (0, 1) , q =1

M

M∑j=1

qj .

Moreover, define the KL divergence between two Bernoulli distributions:

kl(p, q) = p log(pq

)+ (1− p) log

(1− p1− q

).

Note that p log p+ (1− p) log(1− p) ≥ − log 2, which yields

p ≤ kl(p, q) + log 2

logM.

Next, observe that by convexity (Proposition 4.6, 2.), it holds

kl(p, q) ≤ 1

M

M∑j=1

kl(pj , qj) ≤1

M2

M∑j.k=1

kl(Pj(ψ = j), Pk(ψ = j)) .

It remains to show that

kl(Pj(ψ = j), Pk(ψ = j)) ≤ KL(Pj , Pk) .

This result can be seen to follow directly from a well known inequality oftenreferred to as data processing inequality but we are going to prove it directly.

Denote by dP=jk (resp. dP 6=jk ) the conditional density of Pk given ψ(X) = j

(resp. ψ(X) 6= j) and recall that

KL(Pj , Pk) =

∫log(dPj

dPk

)dPj

=

∫ψ=j

log(dPj

dPk

)dPj +

∫ψ 6=j

log(dPj

dPk

)dPj

= Pj(ψ = j)

∫log(dP=j

j

dP=jk

Pj(ψ = j)

Pk(ψ = j)

)dP=j

j

+ Pj(ψ 6= j)

∫log(dP 6=j

dP 6=jk

Pj(ψ 6= j)

Pk(ψ 6= j)

)dP 6=jj

= Pj(ψ = j)(

log(Pj(ψ = j)

Pk(ψ = j)

)+ KL(P=j

j , P=jj ))

+ Pj(ψ 6= j)(

log(Pj(ψ 6= j)

Pk(ψ 6= j)

)+ KL(P 6=jj , P 6=jj )

)≥ kl(Pj(ψ = j), Pk(ψ = j)) .


where we used Proposition 4.6, 1.We have proved that

1

M

M∑j=1

Pj(ψ = j) ≤1M2

∑Mj,k=1 KL(Pj , Pk) + log 2

logM.

Passing to the complemetary sets completes the proof of Fano’s inequality.

Fano’s inequality leads to the following useful theorem.

Theorem 4.11. Assume that Θ contains M ≥ 5 hypotheses θ1, . . . , θM suchthat for some constant 0 < α < 1/4, it holds

(i) |θj − θk|22 ≥ 4φ

(ii) |θj − θk|22 ≤2ασ2

nlog(M)

Then

infθ

supθ∈Θ

IPθ(|θ − θ|22 ≥ φ

)≥ 1

2− 2α .

Proof. in view of (i), it follows from the reduction to hypothesis testing thatit is sufficient to prove that

infψ

max1≤j≤M

IPθj[ψ 6= j

]≥ 1

2− 2α.

If follows from (ii) and Example 4.7 that

KL(IPj , IPk) =n|θj − θk|22

2σ2≤ α log(M) .

Moreover, since M ≥ 5,

1M2

∑Mj,k=1 KL(IPj , IPk) + log 2

log(M − 1)≤ α log(M) + log 2

log(M − 1)≤ 2α+

1

2.

The proof then follows from Fano’s inequality.

Theorem 4.11 indicates that we must take φ ≤ ασ2

2n log(M). Therefore, thelarger the M , the larger the lower bound can be. However, M cannot be ar-bitrary larger because of the constraint (i). We are therefore facing a packingproblem where the goal is to “pack” as many Euclidean balls of radius propor-tional to σ

√log(M)/n in Θ under the constraint that their centers remain close

together (constraint (ii)). If Θ = IRd, this the goal is to pack the Euclideanball of radius R = σ

√2α log(M)/n with Euclidean balls of radius R

√2α/γ.

This can be done using a volume argument (see Problem 4.3). However, wewill use the more versatile lemma below. It gives a a lower bound on the size

4.5. Application to the Gaussian sequence model 104

of a packing of the discrete hypercube {0, 1}d with respect to the Hammingdistance defined by

ρ(ω, ω′) =

d∑i=1

1I(ωi 6= ω′j) , ∀ω, ω′ ∈ {0, 1}d.

Lemma 4.12 (Varshamov-Gilbert). For any γ ∈ (0, 1/2), there exist binaryvectors ω1, . . . ωM ∈ {0, 1}d such that

(i) ρ(ωj , ωk) ≥(1

2− γ)d for all j 6= k ,

(ii) M = beγ2dc ≥ e

γ2d2 .

Proof. Let ωj,i, 1 ≤ i ≤ d, 1 ≤ j ≤M be i.i.d. Bernoulli random variables withparameter 1/2 and observe that

d− ρ(ωj , ωk) = X ∼ Bin(d, 1/2) .

Therefore it follows from a union bound that

IP[∃j 6= k , ρ(ωj , ωk) <

(1

2− γ)d]≤ M(M − 1)

2IP(X − d

2> γd

).

Hoeffding’s inequality then yields

M(M − 1)

2IP(X − d

2> γd

)≤ exp

(− 2γ2d+ log

(M(M − 1)

2

))< 1

as soon asM(M − 1) < 2 exp

(2γ2d

).

A sufficient condition for the above inequality to hold is to take M = beγ2dc ≥eγ2d

2 . For this value of M , we have

IP(∀j 6= k , ρ(ωj , ωk) ≥

(1

2− γ)d)> 0

and by virtue of the probabilistic method, there exist ω1, . . . ωM ∈ {0, 1}d thatsatisfy (i) and (ii)

4.5 APPLICATION TO THE GAUSSIAN SEQUENCE MODEL

We are now in a position to apply Theorem 4.11 by choosing θ1, . . . , θMbased on ω1, . . . , ωM from the Varshamov-Gilbert Lemma.


Lower bounds for estimation

Take γ = 1/4 and apply the Varshamov-Gilbert Lemma to obtain ω1, . . . , ωMwith M = bed/16c ≥ ed/32 and such that ρ(ωj , ωk) ≥ d/4 for all j 6= k. Letθ1, . . . , θM be such that

θj = ωjβσ√n,

for some β > 0 to be chosen later. We can check the conditions of Theorem 4.11:

(i) |θj − θk|22 =β2σ2

nρ(ωj , ωk) ≥ 4

β2σ2d

16n;

(ii) |θj − θk|22 =β2σ2

nρ(ωj , ωk) ≤ β2σ2d

n≤ 32β2σ2

nlog(M) =

2ασ2

nlog(M) ,

for β =√α

4 . Applying now Theorem 4.11 yields

infθ

supθ∈IRd

IPθ(|θ − θ|22 ≥

α

256

σ2d

n

)≥ 1

2− 2α .

It implies the following corollary.

Corollary 4.13. The minimax rate of estimation of over IRd in the Gaussiansequence model is φ(IRd) = σ2d/n. Moreover, it is attained by the least squares

estimator θls = Y.

Note that this rate is minimax over sets Θ that are strictly smaller than IRd

(see Problem 4.4). Indeed, it is minimax over any subset of IRd that containsθ1, . . . , θM .

Lower bounds for sparse estimation

It appears from Table 41 that when estimating sparse vectors, we have to payfor an extra logarithmic term log(ed/k) for not knowing the sparsity patternof the unknown θ∗. In this section, we show that this term is unavoidable asit appears in the minimax optimal rate of estimation of sparse vectors.

Note that the vectors θ1, . . . , θM employed in the previous subsection arenot guaranteed to be sparse because the vectors ω1, . . . , ωM obtained from theVarshamov-Gilbert Lemma may themselves not be sparse. To overcome thislimitation, we need a sparse version of the Varhsamov-Gilbert lemma.

Lemma 4.14 (Sparse Varshamov-Gilbert). There exist positive constants C1

and C2 such that the following holds for any two integers k and d such that1 ≤ k ≤ d/8. There exist binary vectors ω1, . . . ωM ∈ {0, 1}d such that

(i) ρ(ωi, ωj) ≥k

2for all i 6= j ,

(ii) log(M) ≥ k

8log(1 +

d

2k) .


(iii) |ωj |0 = k for all j .

Proof. Take ω1, . . . , ωM independently and uniformly at random from the set

C0(k) = {ω ∈ {0, 1}d : |ω|0 = k}

of k-sparse binary random vectors. Note that C0(k) has cardinality(dk

). To

choose ωj uniformly from C0(k), we proceed as follows. Let U1, . . . , Uk ∈{1, . . . , d} be k random variables such that U1 is drawn uniformly at randomfrom {1, . . . , d} and for any i = 2, . . . , k, conditionally on U1, . . . , Ui−1, the ran-dom variable Ui is drawn uniformly at random from {1, . . . , d}\{U1, . . . , Ui−1}.Then define

ω =

{1 if i ∈ {U1, . . . , Uk}0 otherwise .

Clearly, all outcomes in C0(k) are equally likely under this distribution andtherefore, ω is uniformly distributed on C0(k). Observe that

IP(∃ωj 6= ωk : ρ(ωj , ωk) < k

)=

1(dk

) ∑x∈{0,1}d|x|0=k

IP(∃ωj 6= x : ρ(ωj , x) <

k

2

)

≤ 1(dk

) ∑x∈{0,1}d|x|0=k

M∑j=1

IP(ωj 6= x : ρ(ωj , x) <

k

2

)

= M IP(ω 6= x0 : ρ(ω, x0) <

k

2

),

where ω has the same distribution as ω1 and x0 is any k-sparse vector in{0, 1}d. The last equality holds by symmetry since (i) all the ωjs have thesame distribution and (ii) all the outcomes of ωj are equally likely.

Note that

ρ(ω, x0) ≥ k −k∑i=1

Zi ,

where Zi = 1I(Ui ∈ supp(x0)). Indeed the left hand side is the number ofcoordinates on which the vectors ω, x0 disagree and the right hand side isthe number of coordinates in supp(x0) on which the two vectors disagree. Inparticular, we have that Z1 ∼ Ber(k/d) and for any i = 2, . . . , d, conditionallyon Z1, . . . , Zi−i, we have Zi ∼ Ber(Qi), where

Qi =k −

∑i−1l=1 Zl

p− (i− 1)≤ k

d− k≤ 2k

d,

since k ≤ d/2.Next we apply a Chernoff bound to get that for any s > 0,

IP(ω 6= x0 : ρ(ω, x0) <

k

2

)≤ IP

( k∑i=1

Zi >k

2

)= IE

[exp

(s

k∑i=1

Zi)]e−

sk2


The above MGF can be controlled by induction on k as follows:

IE[

exp(s

k∑i=1

Zi)]

= IE[

exp(s

k−1∑i=1

Zi)IE exp

(sZk

∣∣Z1, . . . , Zk=1

)]= IE

[exp

(s

k−1∑i=1

Zi)(Qk(es − 1) + 1)

]≤ IE

[exp

(s

k−1∑i=1

Zi)](2k

d(es − 1) + 1

)...

≤(2k

d(es − 1) + 1

)k= 2k

For s = log(1 + d2k ). Putting everything together, we get

IP(∃ωj 6= ωk : ρ(ωj , ωk) < k

)≤ exp

(logM + k log 2− sk

2

)= exp

(logM + k log 2− k

2log(1 +

d

2k))

≤ exp(

logM + k log 2− k

2log(1 +

d

2k))

≤ exp(

logM − k

4log(1 +

d

2k))

(for d ≥ 8k)

< 1 ,

if we take M such that

logM <k

4log(1 +

d

2k).

Apply the sparse Varshamov-Gilbert lemma to obtain ω1, . . . , ωM withlog(M) ≥ k

8 log(1 + d2k ) and such that ρ(ωj , ωk) ≥ k/2 for all j 6= k. Let

θ1, . . . , θM be such that

θj = ωjβσ√n

√log(1 +

d

2k) ,

for some β > 0 to be chosen later. We can check the conditions of Theorem 4.11:

(i) |θj − θk|22 =β2σ2

nρ(ωj , ωk) log(1 +

d

2k) ≥ 4

β2σ2

8nk log(1 +

d

2k);

(ii) |θj−θk|22 =β2σ2

nρ(ωj , ωk) log(1+

d

2k) ≤ 2kβ2σ2

nlog(1+

d

2k) ≤ 2ασ2

nlog(M) ,


for β =√

α8 . Applying now Theorem 4.11 yields

infθ

supθ∈IRd

|θ|0≤k

IPθ(|θ − θ|22 ≥

α2σ2

64nk log(1 +

d

2k))≥ 1

2− 2α .


Corollary 4.15. Recall that B0(k) ⊂ IRd denotes the set of all k-sparse vectorsof IRd. The minimax rate of estimation over B0(k) in the Gaussian sequence

model is φ(B0(k)) = σ2kn log(ed/k). Moreover, it is attained by the constrained

least squares estimator θlsB0(k).

Note that the modified BIC estimator of Problem 2.7 and the Slope (2.26)

are also minimax optimal over B0(k). However, unlike θlsB0(k) and the modifiedBIC, the Slope is also adaptive to k. For any ε > 0, the Lasso estimator and theBIC estimator are minimax optimal for sets of parameters such that k ≤ d1−ε.

Lower bounds for estimating vectors in `1 balls

Recall that in Maurey’s argument, we approximated a vector θ such that |θ|1 =

R by a vector θ′ such that |θ′|0 = Rσ

√n

log d . We can essentially do the same

for the lower bound.Assume that d ≥

√n and let β ∈ (0, 1) be a parameter to be chosen later

and define k to be the smallest integer such that

k ≥ R

βσ

√n

log(ed/√n).

Let ω1, . . . , ωM be obtained from the sparse Varshamov-Gilbert Lemma 4.14with this choice of k and define

θj = ωjR

k.

Observe that |θj |1 = R for j = 1, . . . ,M . We can check the conditions ofTheorem 4.11:

(i) |θj − θk|22 =R2

k2ρ(ωj , ωk) ≥ R2

2k≥ 4Rmin

(R8, β2σ

log(ed/√n)

8n

).

(ii) |θj − θk|22 ≤2R2

k≤ 4Rβσ

√log(ed/

√n)

n≤ 2ασ2

nlog(M) ,

for β small enough if d ≥ Ck for some constant C > 0 chosen large enough.Applying now Theorem 4.11 yields

infθ

supθ∈IRd

IPθ(|θ − θ|22 ≥ Rmin

(R8, β2σ2 log(ed/

√n)

8n

))≥ 1

2− 2α .


4.6. Lower bounds for sparse estimation via χ2 divergence 109

Corollary 4.16. Recall that B1(R) ⊂ IRd denotes the set vectors θ ∈ IRd suchthat |θ|1 ≤ R. Then there exist a constant C > 0 such that if d ≥ n1/2+ε,ε > 0, the minimax rate of estimation over B1(R) in the Gaussian sequencemodel is

φ(B0(k)) = min(R2, Rσlog d

n) .

Moreover, it is attained by the constrained least squares estimator θlsB1(R) if

R ≥ σ log dn and by the trivial estimator θ = 0 otherwise.

Proof. To complete the proof of the statement, we need to study risk of thetrivial estimator equal to zero for small R. Note that if |θ∗|1 ≤ R, we have

|0− θ∗|22 = |θ∗|22 ≤ |θ∗|21 = R2 .

Remark 4.17. Note that the inequality |θ∗|22 ≤ |θ∗|21 appears to be quite loose.Nevertheless, it is tight up to a multiplicative constant for the vectors of theform θj = ωj

Rk that are employed in the lower bound. Indeed, if R ≤ σ log d

n ,we have k ≤ 2/β

|θj |22 =R2

k≥ β

2|θj |21 .

4.6 LOWER BOUNDS FOR SPARSE ESTIMATION VIA χ2 DIVERGENCE

In this section, we will show how to derive lower bounds by directly con-trolling the distance between a simple and a composite hypothesis, which isuseful when investigating lower rates for decision problems instead of estima-tion problems.

Define the χ2 divergence between two probability distributions IP,Q as

χ2(IP,Q) =

∫ (

dIP

dQ− 1

)2

dQ if IP� Q,

∞ otherwise.

When we compare the expression in the case where IP � Q, then we see that

both the KL divergence and the χ2 divergence can be written as∫f(dIPdQ

)dQ

with f(x) = x log x and f(x) = (x − 1)2, respectively. Also note that both ofthese functions are convex in x and fulfill f(1) = 0, which by Jensen’s inequalityshows us that they are non-negative and equal to 0 if and only if IP = Q.

Firstly, let us derive another useful expression for the χ2 divergence. Byexpanding the square, we have

χ2(IP,Q) =

∫(dIP)2

dQ− 2

∫dIP +

∫dQ =

∫ (dIP

dQ

)2

dQ− 1

Secondly, we note that we can bound the KL divergence from above by theχ2 divergence, via Jensen’s inequality, as

KL(IP,Q) =

∫log

(dIP

dQ

)dIP ≤ log

∫(dIP)2

dQ= log(1 + χ2(IP,Q)) ≤ χ2(IP,Q),

(4.8)


where we used the concavity inequality log(1 + x) ≤ x for x > −1.Note that this estimate is a priori very rough and that we are transitioning

from something at log scale to something on a regular scale. However, sincewe are mostly interested in showing that these distances are bounded by asmall constant after adjusting the parameters of our models appropriately, itwill suffice for our purposes. In particular, we can combine (4.8) with theNeyman-Pearson Lemma 4.3 and Pinsker’s inequality, Lemma 4.8, to get

infψ

IP0(ψ = 1) + IP1(ψ = 0) ≥ 1

2(1−

√ε)

for distinguishing between two hypothesis with a decision rule ψ if χ2(IP0, IP1) ≤ε.

In the following, we want to apply this observation to derive lower rates fordistinguishing between

H0 : Y ∼ IP0 H1 : Y ∼ IPu, for some u ∈ B0(k),

where IPu denotes the probability distribution of a Gaussian sequence modelY = u+ σ

nξ, with ξ ∼ N (0, Id).

Theorem 4.18. Consider the detection problem

H0 : θ∗ = 0, Hv : θ∗ = µv, for some v ∈ B0(k), ‖v‖2 = 1,

with data given by Y = θ∗ + σ√nξ, ξ ∼ N (0, Id).

There exist ε > 0 and cε > 0 such that if

µ ≤ σ

2

√k

n

√log

(1 +

dε

k2

)then,

infψ{IP0(ψ = 1) ∨ max

v∈B0(k)‖v‖2=1

IPv(ψ = 0)} ≥ 1

2(1−

√ε).

Proof. We introduce the mixture hypothesis

IP =1(dk

) ∑S⊆[d]|S|=k

IPS ,

with IPS being the distribution of Y = µ1IS/√k + σ√

nξ and give lower bounds

on distinguishingH0 : Y ∼ IP0, H1 : Y ∼ IP,

by computing their χ2 distance,

χ2(IP, IP0) =

∫ (dp

dIP0

)2

dIP0 − 1

=1(dk

) ∑S,T

∫ (dIPSdIP0

dIPTdIP0

)dIP0 − 1.


The first step is to compute dIPS/(dIP0). Writing out the corresponding Gaus-sian densities yields

dIPSdIP0

(X) =

1√2πd exp(− n

2σ2 ‖X − µ1IS/√k‖22)

1√2πd exp(− n

2σ2 ‖X − 0‖22)

= exp

(n

2σ2

(2µ√k〈X, 1IS〉 − µ2

))For convenience, we introduce the notation X = σ√

nZ for a standard normal

Z ∼ N (0, Id), ν = µ√n/(σ

√k), and write ZS for the restriction to the coor-

dinates in the set S. Multiplying two of these densities and integrating withrespect to IP0 in turn then reduces to computing the mgf of a Gaussian andgives

IEX∼IP0

[exp

(n

2σ2

(2µ

σ√kn

(ZS + ZT )− 2µ2

))]= IEZ∼N (0,Id) exp

(ν(ZS + ZT )− ν2k

).

Decomposing ZS + ZT = 2∑i∈S∩T Zi +

∑i∈S∆T Zi and noting that |S∆T | ≤

2k, we see that

IEIP0

(dIPSdIP0

dIPTdIP0

)= exp

(4

2ν2|S ∩ T +

ν2

2|S∆T | − ν2k

)≤ exp

(2ν2|S ∩ T |

).

Now, we need to take the expectation over two uniform draws of supportsets S and T , which reduces via conditioning and exploiting the independenceof the two draws to

χ2(IP, IP0) ≤ IES,T [exp(2ν2|S ∩ T |]− 1

= IET IES [[exp(2ν2|S ∩ T | |T ]]− 1

= IES [exp(2ν2|S ∩ [k]|)]− 1.

Similar to the proof of the sparse Varshamov-Gilbert bound, Lemma 4.14, thedistribution of |S ∩ [k]| is stochastically dominated by a binomial distributionBin(k, k/d), so that

χ2(IP, IP0) ≤ IE[exp(2ν2Bin(k,k

d)]− 1

=(IE[exp(2ν2Ber(k/d))]

)k − 1

=

(e2ν2 k

d+

(1− k

d

))k− 1

=

(1 +

k

d(e2ν2

− 1)

)k− 1

4.7. Lower bounds for estimating the `1 norm via momentmatching 112

Since (1 + x)k − 1 ≈ kx for x = o(1/k), we can take ν2 = 12 log(1 + d

k2 ε) to geta bound of the order ε. Plugging this back into the definition of ν yields

µ =σ

2

√k

n

√log

(1 +

dε

k2

).

The rate for detection for arbitrary v now follows from

IP0(ψ = 1) ∨ maxv∈B0(k)‖v‖2=1

IPv(ψ = 0) ≥ IP0(ψ = 1) ∨ IP(ψ = 0)

≥ 1

2(1−

√ε).

Remark 4.19. If instead of the χ2 divergence, we try to compare the KL di-vergence between IP0 and IP, we are tempted to exploit its convexity propertiesto get

KL(IP, IP0) ≤ 1(dk

) ∑S

KL(IPS , IP0) = KL(IP[k], IP0),

which is just the distance between two Gaussian distributions. In turn, we donot see any increase in complexity brought about by increasing the number ofcompeting hypothesis as we did in Section 4.5. By using the convexity estimate,we have lost the granularity of controlling how much the different probabilitydistributions overlap. This is where the χ2 divergence helped.

Remark 4.20. This rate for detection implies lower rates for estimation ofboth θ∗ and ‖θ∗‖2: If given an estimator T (X) for T (θ) = ‖θ‖2 that achievesan error less than µ/2 for some X, then we would get a correct decision rulefor those X by setting

ψ(X) =

0, if T (X) <µ

2,

1, if T (X) ≥ µ

2.

Conversely, the lower bound above means that such an error rate can not beachieved with vanishing error probability for µ smaller than the critical value.Moreover, by the triangle inequality, we can the same lower bound also forestimaton of θ∗ itself.

We note that the rate that we obtained is slightly worse in the scaling ofthe log factor (d/k2 instead of d/k) than the one obtained in Corollary 4.15.This is because it applies to estimating the `2 norm as well, which is a strictlyeasier problem and for which estimators can be constructed that achieve thefaster rate of k

n log(d/k2).

4.7 LOWER BOUNDS FOR ESTIMATING THE `1 NORM VIA MOMENTMATCHING

So far, we have seen many examples of rates tha scale with n−1 for thesquared error. In this section, we are going to see an example for estimating


a functional that can only be estimated at a much slower rate of 1/ log n,following [CL11].

Consider the normalized `1 norm,

T (θ) =1

n

n∑i=1

|θi|, (4.9)

as the target functional to be estimated, and measurements

Y ∼ N(θ, In). (4.10)

We are going to show lower bounds for the estimation if restricted to an `∞ball, Θn(M) = {θ ∈ IRn : |θi| ≤M}, and for a general θ ∈ IRn.

A constrained risk inequality

We start with a general principle that is in the same spirit as the Neyman-Pearson lemma, (Lemma 4.3), but deals with expectations rather than prob-abilities and uses the χ2 distance to estimate the closeness of distributions.Moreover, contrary to what we have seen so far, it allows us to compare twomixture distributions over candidate hypotheses, also known as priors.

In the following, we write X for a random observation coming from a dis-tribution indexed by θ ∈ Θ, T = T (X) an estimator for a function T (θ), and

write B(θ) = IEθT − T (θ) for the bias of T .Let µ0 and µ1 be two prior distributions over Θ and denote by mi, v

2i the

means and variance of T (θ) under the priors µi, i ∈ {0, 1},

mi =

∫T (θ)dµi(θ),

∫v2i =

∫(T (θ)−mi)

2dµi(θ).

Moreover, write Fi for the marginal distributions over the priors and denotetheir density with respect to the average of the two probability distributionsby fi. With this, we write the χ2 distance between f0 and f1 as

I2 = χ2(IPf0, IPf1

) = IEf0

(f1(X)

f0(X)− 1

)2

= IEf0

(f1(X)

f0(X)

)2

− 1.

Theorem 4.21. (i) If∫

IEθ(T (X)− T (θ))2dµ0(θ) ≤ ε2,∣∣∣∣∫ B(θ)dµ1(θ)−∫B(θ)dµ0(θ)

∣∣∣∣ ≥ |m1 −m0| − (ε+ v0)I. (4.11)

(ii) If |m1 −m0| > v0I and 0 ≤ λ ≤ 1,∫IEθ(T (X)− T (θ))2d(λµ0 + (1− λ)µ1)(θ) ≥ λ(1− λ)(|m1 −m0| − v0I)2

λ+ (1− λ)(I + 1)2,

(4.12)


in particular

maxi∈{0,1}

∫IEθ(T (X)− T (θ))2dµi(θ) ≥

(|m1 −m0| − v0I)2

(I + 2)2, (4.13)

and

supθ∈Θ

IEθ(T (X)− T (θ))2 ≥ (|m1 −m0| − v0I)2

(I + 2)2. (4.14)

Proof. Without loss of generality, assume m1 ≥ m0. We start by consideringthe term

IEf0

[(T (X)−m0)

(f1(X)− f0(X)

f0(X)

)]= IE

[T (X)

(f1(X)− f0(X)

f0(X)

)]= IEf1

[m1 + T (X)−m1

]− IEf0

[m0 + T (X)−m0

]= m1 +

∫B(θ)dµ1(θ)−

(m0 +

∫B(θ)dµ0(θ)

).

Moreover,

IEf0(T (X)−m0)2 =

∫IEθ(T (X)−m0)2dµ0(θ)

=

∫IEθ(T (X)− T (θ) + T (θ)−m0)2dµ0(θ)

=

∫IEθ(T (X)− T (θ))2dµ0(θ)

+ 2

∫B(θ)(T (θ)−m0)dµ0(θ)

+

∫(T (θ)−m0)2dµ0(θ)

≤ ε2 + 2εv0 + v20 = (ε+ v0)2.

Therefore, by Cauchy-Schwarz,

IEf0

[(T (X)−m0)

(f1(X)− f0(X)

f0(X)

)]≤ (ε+ v0)I,

and hence

m1 +

∫B(θ)dµ1(θ)−

(m0 +

∫B(θ)dµ0(θ)

)≤ (ε+ v0)I,

whence ∫B(θ)dµ0(θ)−

∫B(θ)dµ1(θ) ≥ m1 −m0 − (ε+ v0)I,


which gives us (4.11).To estimate the risk under a mixture of µ0 and µ1, we note that from (4.11),

we have an estimate for the risk under µ1 given an upper bound on the riskunder µ0, which means we can reduce this problem to estimating a quadraticfrom below. Consider

J(x) = λx2 + (1− λ)(a− bx)2,

with 0 < λ < 1, a > 0 and b > 0. J is minimized by x = xmin = ab(1−λ)λ+b2(1−λ)

with a− bxmin > 0 and J(xmin) = a2λ(1−λ)λ+b2(1−λ) . Hence, if we add a cut-off at zero

in the second term, we do not change the minimal value and obtain that

λx2 + (1− λ)((a− bx)+)2

has the same minimum.From (4.11) and Jensen’s inequality, setting ε2 =

∫B(θ)2dµ0(θ), we get∫

B(θ)2dµ1(θ) ≥(∫

B(θ)dµ1(θ)

)2

≥ ((m1 −m0 − (ε+ v0)I − ε)+)2.

Combining this with the estimate for the quadratic above yields∫IEθ(T (X)− T (θ))2d(λµ0 + (1− λ)µ1)(θ)

≥ λε2 + (1− λ)((m1 −m0 − (ε+ v0)I − ε)+)2

≥ λ(1− λ)(|m1 −m0| − v0I)2

λ+ (1− λ)(I + 1)2,

which is (4.12).Finally, since the minimax risk is bigger than any mixture risk, we can

bound it from below by the maximum value of this bound, which is obtainedat λ = (I + 1)/(I + 2) to get (4.13), and by the same argument (4.14).

Unfavorable priors via polynomial approximation

In order to construct unfavorable priors for the `1 norm (4.9), we resort topolynomial approximation theory. For a continuous function f on [−1, 1], de-note the maximal approximation error by polynomials of degree k, written asPk, by

δk(f) = infG∈Pk

maxx∈[−1,1]

|f(x)−G(x)|,

and the polynomial attaining this error by G∗k. For f(t) = |t|, it is known[Riv90] that

β∗ = limk→∞

2kδ2k(f) ∈ (0,∞),


which is a very slow rate of convergence of the approximation, compared tosmooth functions for which we would expect geometric convergence.

We can now construct our priors using some abstract measure theoreticalresults.

Lemma 4.22. For an integer k > 0, there are two probability measures ν0 andν1 on [−1, 1] that fulfill the following:

1. ν0 and ν1 are symmetric about 0;

2.∫tldν1(t) =

∫tldν0(t) for all l ∈ {0, 1, . . . , k};

3.∫|t|dν1(t)−

∫|t|dν0(t) = 2δk.

Proof. The idea is to construct the measures via the Hahn-Banach theoremand the Riesz representation theorem.

First, consider f(t) = |t| as an element of the space C([−1, 1]) equipped withthe supremum norm and define Pk as the space of polynomials of degree up tok, and Fk := span(Pk ∪ {f}). On Fk, define the functional T (cf + pk) = cδk,which is well-defined because f is not a polynomial.

Claim: ‖T‖ = sup{T (g) : g ∈ Fk, ‖g‖∞ ≤ 1} = 1.‖T‖ ≥ 1: Let G∗k be the best-approximating polynomial to f in Pk. Then,

‖f −G∗k‖∞ = δk, ‖(f −G∗k)/δk‖∞ = 1, and T ((f −G∗k)/δk) = 1.‖T‖ ≤ 1: Suppose g = cf + pk with pk ∈ Pk, ‖g‖∞ = 1 and T (g) > 1.

Then c > 1/δk and

‖f − (−pk/c)‖∞ =1

c< δk,

contradicting the definition of δk.Now, by the Hahn-Banach theorem, there is a norm-preserving extension

of T to C([−1, 1]) which we again denote by T . By the Riesz representationtheorem, there is a Borel signed measure τ with variation equal to 1 such that

T (g) =

∫ 1

−1

g(t)dτ(t), for all g ∈ C([−1, 1]).

The Hahn-Jordan decomposition gives two positive measures τ+ and τ− suchthat τ = τ+ − τ− and∫ 1

−1

|t|d(τ+ − τ−)(t) = δk and∫ 1

−1

tldτ+(t) =

∫ 1

−1

tldτ−(t), l ∈ {0, . . . , k}.

Since f is a symmetric function, we can symmetrize these measures and hencecan assume that they are symmetric about 0. Finally, to get a probabilitymeasures with the desired properties, set ν1 = 2τ+ and ν0 = 2τ−.


Minimax lower bounds

Theorem 4.23. We have the following bounds on the minimax rate for T (θ) =1n

∑ni=1 |θi|: For θ ∈ Θn(M) = {θ ∈ IRn : |θi| ≤M},

infT

supθ∈Θn(M)

IEθ(T (X)− T (θ))2 ≥ β2∗M

2

(log log n

log n

)2

(1 + o(1)). (4.15)

For θ ∈ IRn,

infT

supθ∈IRn

IEθ(T (X)− T (θ))2 ≥ β2∗

16e2 log n(1 + o(1)). (4.16)

The last remaining ingredient for the proof are the Hermite polynomials, afamily of orthogonal polynomials with respect to the Gaussian density

ϕ(y) =1√2π

e−y2/2.

For our purposes, it is enough to define them by the derivatives of the density,

dk

dykϕ(y) = (−1)kHk(y)ϕ(y),

and to observe that they are orthogonal,∫H2k(y)ϕ(y)dy = k!,

∫Hk(y)Hj(y)ϕ(y)dy = 0, for k 6= j.

Proof. We want to use Theorem 4.21. To construct the prior measures, we scalethe measures from Lemma 4.22 appropriately: Let kn be an even integer thatis to be determined, and ν0, ν1 the two measures given by Lemma 4.22. Defineg(x) = Mx and define the measures µi by dilating νi, µi(A) = νi(g

−1(A)) forevery Borel set A ⊆ [−M,M ] and i ∈ {0, 1}. Hence,

1. µ0 and µ1 are symmetric about 0;

2.∫tldµ1(t) =

∫tldµ0(t) for all l ∈ {0, 1, . . . , kn};

3.∫|t|dµ1(t)−

∫|t|dµ0(t) = 2Mδkn .

To get priors for n i.i.d. samples, consider the product priors µni =∏nj=1 µi.

With this, we have

IEµn1 T (θ)− IEµn0 T (θ) = IEµ1 |θ1| − IEµ0 |θ1| = 2Mδkn ,

and

IEµn0 (T (θ)− IEµn0 T (θ))2 = IEµn0 (T (θ)− IEµn0 T (θ))2 ≤ M2

n,

since each θi ∈ [−M,M ].


It remains to control the χ2 distance between the two marginal distribu-tions. To this end, we will expand the Gaussian density in terms of Hermitepolynomials. First, set fi(y) =

∫ϕ(y − t)dµi(t). Since g(x) = exp(−x) is

convex and µ0 is symmetric,

f0(y) ≥ 1√2π

exp

(−∫

(y − t)2

2dµ0(t)

)= ϕ(y) exp

(−1

2M2

∫t2dν0(t)

)≥ ϕ(y) exp

(−1

2M2

).

Expand ϕ as

ϕ(y − t) =

∞∑k=0

Hk(y)ϕ(y)tk

k!.

Then, ∫(f1(y)− f0(y))2

f0(y)dy ≤ eM

2/2∞∑

k=kn+1

1

k!M2k.

The χ2 distance for n i.i.d. observations can now be easily bounded by

I2n =

∫(∏ni=1 f1(yi)−

∏ni=1 f0(yi))

2∏ni=1 f0(yi)

dy1· · · dyn

=

∫(∏ni=1 f1(yi))

2∏ni=1 f0(yi)

dy1· · · dyn − 1

=

(n∏i=1

∫(f1(yi))

2

f0(yi)dyi

)− 1

≤

(1 + eM

2/2∞∑

k=kn+1

1

k!M2k

)n− 1

≤(

1 + e3M2/2 1

kn!M2kn

)n− 1

≤

(1 + e3M2/2

(eM2

kn

)kn)n− 1, (4.17)

where the last step used a Stirling type estimate, k! > (k/e)k.Note that if x = o(1/n), then (1 + x)n − 1 = o(nx) by Taylor’s formula,

so we can choose kn ≥ log n/(log log n) to guarantee that for n large enough,In → 0. With Theorem 4.21, we have

infT

supθ∈Θn(M)

IE(T − T (θ))2 ≥ (2Mδkn − (M/√n)In)2

(In + 2)2

= β2∗M

2

(log log n

log n

)2

(1 + o(1)).


To prove the lower bound over IRn, take M =√

log n and kn to be thesmallest integer such that kn ≥ 2e log n and plug this into (4.17), yielding

I2n ≤

(1 + n3/2

(e log n

2e log n

)kn)n− 1,

which goes to zero. Hence, we can conclude just as before.

Note that [CL11] also complement these lower bounds with upper boundsthat are based on polynomial approximation, completing the picture of esti-mating T (θ).

4.8. Problem Set 120

4.8 PROBLEM SET

Problem 4.1. (a) Prove the statement of Example 4.7.

(b) Let Pθ denote the distribution of X ∼ Ber(θ). Show that

KL(Pθ, Pθ′) ≥ C(θ − θ′)2 .

Problem 4.2. Let IP0 and IP1 be two probability measures. Prove that forany test ψ, it holds

maxj=0,1

IPj(ψ 6= j) ≥ 1

4e−KL(IP0,IP1) .

Problem 4.3. For any R > 0, θ ∈ IRd, denote by B2(θ,R) the (Euclidean)ball of radius R and centered at θ. For any ε > 0 let N = N(ε) be the largestinteger such that there exist θ1, . . . , θN ∈ B2(0, 1) for which

|θi − θj | ≥ 2ε

for all i 6= j. We call the set {θ1, . . . , θN} an ε-packing of B2(0, 1) .

(a) Show that there exists a constant C > 0 such that N ≤ Cd/εd .

(b) Show that for any x ∈ B2(0, 1), there exists i = 1, . . . , N such that|x− θi|2 ≤ 2ε.

(c) Use (b) to conclude that there exists a constant C ′ > 0 such that N ≥C ′d/ε

d .

Problem 4.4. Show that the rate φ = σ2d/n is the minimax rate of estimationover:

(a) The Euclidean Ball of IRd with radius√σ2d/n.

(b) The unit `∞ ball of IRd defined by

B∞(1) = {θ ∈ IRd : maxj|θj | ≤ 1}

as long as σ2 ≤ n.

(c) The set of nonnegative vectors of IRd.

(d) The discrete hypercube σ16√n{0, 1}d .

Problem 4.5. Fix β ≥ 5/3, Q > 0 and prove that the minimax rate of esti-

mation over Θ(β,Q) with the ‖ · ‖L2([0,1])-norm is given by n−2β

2β+1 .[Hint:Consider functions of the form

fj =C√n

N∑i=1

ωjiϕi

where C is a constant, ωj ∈ {0, 1}N for some appropriately chosen

N and {ϕj}j≥1 is the trigonometric basis.]

Chapter

5Matrix estimation

Over the past decade or so, matrices have entered the picture of high-dimensionalstatistics for several reasons. Perhaps the simplest explanation is that they arethe most natural extension of vectors. While this is true, and we will see exam-ples where the extension from vectors to matrices is straightforward, matriceshave a much richer structure than vectors allowing “interaction” between theirrows and columns. In particular, while we have been describing simple vectorsin terms of their sparsity, here we can measure the complexity of a matrix byits rank. This feature was successfully employed in a variety of applicationsranging from multi-task learning to collaborative filtering. This last applicationwas made popular by the Netflix prize in particular.

In this chapter, we study several statistical problems where the parameter ofinterest θ is a matrix rather than a vector. These problems include: multivari-ate regression, covariance matrix estimation and principal component analysis.Before getting to these topics, we begin by a quick reminder on matrices andlinear algebra.

5.1 BASIC FACTS ABOUT MATRICES

Matrices are much more complicated objects than vectors. In particular,while vectors can be identified with linear operators from IRd to IR, matricescan be identified to linear operators from IRd to IRn for n ≥ 1. This seeminglysimple fact gives rise to a profusion of notions and properties as illustrated byBernstein’s book [Ber09] that contains facts about matrices over more than athousand pages. Fortunately, we will be needing only a small number of suchproperties, which can be found in the excellent book [GVL96], that has becomea standard reference on matrices and numerical linear algebra.

121

5.1. Basic facts about matrices 122

Singular value decomposition

Let A = {aij , 1 ≤ i ≤ m, 1 ≤ j ≤ n} be a m × n real matrix of rank r ≤min(m,n). The Singular Value Decomposition (SVD) of A is given by

A = UDV > =

r∑j=1

λjujv>j ,

where D is an r×r diagonal matrix with positive diagonal entries {λ1, . . . , λr},U is a matrix with columns {u1, . . . , ur} ∈ IRm that are orthonormal, and V isa matrix with columns {v1, . . . , vr} ∈ IRn that are also orthonormal. Moreover,it holds that

AA>uj = λ2juj , and A>Avj = λ2

jvj

for j = 1, . . . , r. The values λj > 0 are called singular values of A and areuniquely defined. If rank r < min(n,m) then the singular values of A aregiven by λ = (λ1, . . . , λr, 0, . . . , 0)> ∈ IRmin(n,m) where there are min(n,m)− rzeros. This way, the vector λ of singular values of a n ×m matrix is a vectorin IRmin(n,m).

In particular, if A is an n × n symmetric positive semidefinite (PSD), i.e.A> = A and u>Au ≥ 0 for all u ∈ IRn, then the singular values of A are equalto its eigenvalues.

The largest singular value of A denoted by λmax (A) also satisfies the fol-lowing variational formulation:

λmax (A) = maxx∈IRn

|Ax|2|x|2

= maxx∈IRn

y∈IRm

y>Ax

|y|2|x|2= max

x∈Sn−1

y∈Sm−1

y>Ax .

In the case of a n× n PSD matrix A, we have

λmax (A) = maxx∈Sn−1

x>Ax .

In these notes, we always assume that eigenvectors and singular vectorshave unit norm.

Norms and inner product

Let A = {aij} and B = {bij} be two real matrices. Their size will be implicitin the following notation.

Vector norms

The simplest way to treat a matrix is to deal with it as if it were a vector. Inparticular, we can extend `q norms to matrices:

|A|q =(∑

ij

|aij |q)1/q

, q > 0 .

5.1. Basic facts about matrices 123

The cases where q ∈ {0,∞} can also be extended matrices:

|A|0 =∑ij

1I(aij 6= 0) , |A|∞ = maxij|aij | .

The case q = 2 plays a particular role for matrices and |A|2 is called theFrobenius norm of A and is often denoted by ‖A‖F . It is also the Hilbert-Schmidt norm associated to the inner product:

〈A,B〉 = Tr(A>B) = Tr(B>A) .

Spectral norms

Let λ = (λ1, . . . , λr, 0, . . . , 0) be the singular values of a matrix A. We candefine spectral norms on A as vector norms on the vector λ. In particular, forany q ∈ [1,∞],

‖A‖q = |λ|q ,is called Schatten q-norm of A. Here again, special cases have special names:

• q = 2: ‖A‖2 = ‖A‖F is the Frobenius norm defined above.

• q = 1: ‖A‖1 = ‖A‖∗ is called the Nuclear norm (or trace norm) of A.

• q = ∞: ‖A‖∞ = λmax (A) = ‖A‖op is called the operator norm (orspectral norm) of A.

We are going to employ these norms to assess the proximity to our matrixof interest. While the interpretation of vector norms is clear by extension fromthe vector case, the meaning of “‖A−B‖op is small” is not as transparent. Thefollowing subsection provides some inequalities (without proofs) that allow abetter reading.

Useful matrix inequalities

Let A and B be two m× n matrices with singular values λ1(A) ≥ λ2(A) . . . ≥λmin(m,n)(A) and λ1(B) ≥ . . . ≥ λmin(m,n)(B) respectively. Then the followinginequalities hold:

maxk

∣∣λk(A)− λk(B)∣∣ ≤ ‖A−B‖op , Weyl (1912)∑

k

∣∣λk(A)− λk(B)∣∣2 ≤ ‖A−B‖2F , Hoffman-Weilandt (1953)

〈A,B〉 ≤ ‖A‖p‖B‖q ,1

p+

1

q= 1, p, q ∈ [1,∞] , Holder

The singular value decomposition is associated to the following useful lemma,known as the Eckart-Young (or Eckart-Young-Mirsky) Theorem. It states thatfor any given k the closest matrix of rank k to a given matrix A in Frobeniusnorm is given by its truncated SVD.

5.2. Multivariate regression 124

Lemma 5.1. Let A be a rank-r matrix with singular value decomposition

A =

r∑i=1

λiuiv>i ,

where λ1 ≥ λ2 ≥ · · · ≥ λr > 0 are the ordered singular values of A. For anyk < r, let Ak to be the truncated singular value decomposition of A given by

Ak =

r∑i=1

λiuiv>i .

Then for any matrix B such that rank(B) ≤ k, it holds

‖A−Ak‖F ≤ ‖A−B‖F .

Moreover,

‖A−Ak‖2F =

r∑j=k+1

λ2j .

Proof. Note that the last equality of the lemma is obvious since

A−Ak =

r∑j=k+1

λjujv>j

and the matrices in the sum above are orthogonal to each other.Thus, it is sufficient to prove that for any matrix B such that rank(B) ≤ k,

it holds

‖A−B‖2F ≥r∑

j=k+1

λ2j .

To that end, denote by σ1 ≥ σ2 ≥ · · · ≥ σk ≥ 0 the ordered singular valuesof B—some may be equal to zero if rank(B) < k—and observe that it followsfrom the Hoffman-Weilandt inequality that

‖A−B‖2F ≥r∑j=1

(λj − σj

)2=

k∑j=1

(λj − σj

)2+

r∑j=k+1

λ2j ≥

r∑j=k+1

λ2j .

5.2 MULTIVARIATE REGRESSION

In the traditional regression setup, the response variable Y is a scalar. Inseveral applications, the goal is not to predict a variable but rather a vectorY ∈ IRT , still from a covariate X ∈ IRd. A standard example arises in genomicsdata where Y contains T physical measurements of a patient and X containsthe expression levels for d genes. As a result the regression function in this casef(x) = IE[Y |X = x] is a function from IRd to IRT . Clearly, f can be estimatedindependently for each coordinate, using the tools that we have developed inthe previous chapter. However, we will see that in several interesting scenar-ios, some structure is shared across coordinates and this information can beleveraged to yield better prediction bounds.


The model

Throughout this section, we consider the following multivariate linear regres-sion model:

Y = XΘ∗ + E , (5.1)

where Y ∈ IRn×T is the matrix of observed responses, X is the n× d observeddesign matrix (as before), Θ ∈ IRd×T is the matrix of unknown parameters andE ∼ subGn×T (σ2) is the noise matrix. In this chapter, we will focus on theprediction task, which consists in estimating XΘ∗.

As mentioned in the foreword of this chapter, we can view this problem as T(univariate) linear regression problems Y (j) = Xθ∗,(j)+ε(j), j = 1, . . . , T , whereY (j), θ∗,(j) and ε(j) are the jth column of Y,Θ∗ and E respectively. In particu-lar, an estimator for XΘ∗ can be obtained by concatenating the estimators foreach of the T problems. This approach is the subject of Problem 5.1.

The columns of Θ∗ correspond to T different regression tasks. Consider thefollowing example as a motivation. Assume that the Subway headquarterswant to evaluate the effect of d variables (promotions, day of the week, TVads,. . . ) on their sales. To that end, they ask each of their T = 40, 000restaurants to report their sales numbers for the past n = 200 days. As aresult, franchise j returns to headquarters a vector Y(j) ∈ IRn. The d variablesfor each of the n days are already known to headquarters and are stored ina matrix X ∈ IRn×d. In this case, it may be reasonable to assume that thesame subset of variables has an impact of the sales for each of the franchise,though the magnitude of this impact may differ from franchise to franchise. Asa result, one may assume that the matrix Θ∗ has each of its T columns thatis row sparse and that they share the same sparsity pattern, i.e., Θ∗ is of theform:

Θ∗ =

0 0 0 0• • • •• • • •0 0 0 0...

......

...0 0 0 0• • • •

,

where • indicates a potentially nonzero entry.It follows from the result of Problem 5.1 that if each task is performed

individually, one may find an estimator Θ such that

1

nIE‖XΘ− XΘ∗‖2F . σ2 kT log(ed)

n,

where k is the number of nonzero coordinates in each column of Θ∗. Weremember that the term log(ed) corresponds to the additional price to payfor not knowing where the nonzero components are. However, in this case,when the number of tasks grows, this should become easier. This fact wasproved in [LPTVDG11]. We will see that we can recover a similar phenomenon


when the number of tasks becomes large, though larger than in [LPTVDG11].Indeed, rather than exploiting sparsity, observe that such a matrix Θ∗ has rankk. This is the kind of structure that we will be predominantly using in thischapter.

Rather than assuming that the columns of Θ∗ share the same sparsitypattern, it may be more appropriate to assume that the matrix Θ∗ is low rankor approximately so. As a result, while the matrix may not be sparse at all,the fact that it is low rank still materializes the idea that some structure isshared across different tasks. In this more general setup, it is assumed that thecolumns of Θ∗ live in a lower dimensional space. Going back to the Subwayexample this amounts to assuming that while there are 40,000 franchises, thereare only a few canonical profiles for these franchises and that all franchises arelinear combinations of these profiles.

Sub-Gaussian matrix model

Recall that under the assumption ORT for the design matrix, i.e., X>X = nId,then the univariate regression model can be reduced to the sub-Gaussian se-quence model. Here we investigate the effect of this assumption on the multi-variate regression model (5.1).

Observe that under assumption ORT,

1

nX>Y = Θ∗ +

1

nX>E .

Which can be written as an equation in IRd×T called the sub-Gaussian matrixmodel (sGMM):

y = Θ∗ + F , (5.2)

where y = 1nX>Y and F = 1

nX>E ∼ subGd×T (σ2/n).

Indeed, for any u ∈ Sd−1, v ∈ ST−1, it holds

u>Fv =1

n(Xu)>Ev =

1√nw>Ev ∼ subG(σ2/n) ,

where w = Xu/√n has unit norm: |w|22 = u> X>X

n u = |u|22 = 1.Akin to the sub-Gaussian sequence model, we have a direct observation

model where we observe the parameter of interest with additive noise. Thisenables us to use thresholding methods for estimating Θ∗ when |Θ∗|0 is small.However, this also follows from Problem 5.1. The reduction to the vector casein the sGMM is just as straightforward. The interesting analysis begins whenΘ∗ is low-rank, which is equivalent to sparsity in its unknown eigenbasis.

Consider the SVD of Θ∗:

Θ∗ =∑j

λjujv>j .

and recall that ‖Θ∗‖0 = |λ|0. Therefore, if we knew uj and vj , we couldsimply estimate the λjs by hard thresholding. It turns out that estimatingthese eigenvectors by the eigenvectors of y is sufficient.


Consider the SVD of the observed matrix y:

y =∑j

λj uj v>j .

Definition 5.2. The singular value thresholding estimator with threshold2τ ≥ 0 is defined by

Θsvt =∑j

λj1I(|λj | > 2τ)uj v>j .

Recall that the threshold for the hard thresholding estimator was chosen tobe the level of the noise with high probability. The singular value thresholdingestimator obeys the same rule, except that the norm in which the magnitude ofthe noise is measured is adapted to the matrix case. Specifically, the followinglemma will allow us to control the operator norm of the matrix F .

Lemma 5.3. Let A be a d × T random matrix such that A ∼ subGd×T (σ2).Then

‖A‖op ≤ 4σ√

log(12)(d ∨ T ) + 2σ√

2 log(1/δ)

with probability 1− δ.

Proof. This proof follows the same steps as Problem 1.4. Let N1 be a 1/4-net for Sd−1 and N2 be a 1/4-net for ST−1. It follows from Lemma 1.18that we can always choose |N1| ≤ 12d and |N2| ≤ 12T . Moreover, for anyu ∈ Sd−1, v ∈ ST−1, it holds

u>Av ≤ maxx∈N1

x>Av +1

4maxu∈Sd−1

u>Av

≤ maxx∈N1

maxy∈N2

x>Ay +1

4maxx∈N1

maxv∈ST−1

x>Av +1

4maxu∈Sd−1

u>Av

≤ maxx∈N1

maxy∈N2

x>Ay +1

2maxu∈Sd−1

maxv∈ST−1

u>Av

It yields‖A‖op ≤ 2 max

x∈N1

maxy∈N2

x>Ay

So that for any t ≥ 0, by a union bound,

IP(‖A‖op > t

)≤∑x∈N1y∈N2

IP(x>Ay > t/2

)Next, since A ∼ subGd×T (σ2), it holds that x>Ay ∼ subG(σ2) for any x ∈N1, y ∈ N2.Together with the above display, it yields

IP(‖A‖op > t

)≤ 12d+T exp

(− t2

8σ2

)≤ δ

fort ≥ 4σ

√log(12)(d ∨ T ) + 2σ

√2 log(1/δ) .


The following theorem holds.

Theorem 5.4. Consider the multivariate linear regression model (5.1) underthe assumption ORT or, equivalently, the sub-Gaussian matrix model (5.2).Then, the singular value thresholding estimator Θsvt with threshold

2τ = 8σ

√log(12)(d ∨ T )

n+ 4σ

√2 log(1/δ)

n, (5.3)

satisfies

1

n‖XΘsvt − XΘ∗‖2F = ‖Θsvt −Θ∗‖2F ≤ 144 rank(Θ∗)τ2

.σ2 rank(Θ∗)

n

(d ∨ T + log(1/δ)

).


Proof. Assume without loss of generality that the singular values of Θ∗ and yare arranged in a non increasing order: λ1 ≥ λ2 ≥ . . . and λ1 ≥ λ2 ≥ . . . .Define the set S = {j : |λj | > 2τ}.

Observe first that it follows from Lemma 5.3 that ‖F‖op ≤ τ for τ chosen asin (5.3) on an event A such that IP(A) ≥ 1− δ. The rest of the proof assumesthat the event A occurred.

Note that it follows from Weyl’s inequality that |λj − λj | ≤ ‖F‖op ≤ τ . Itimplies that S ⊂ {j : |λj | > τ} and Sc ⊂ {j : |λj | ≤ 3τ}.

Next define the oracle Θ =∑j∈S λjujv

>j and note that

‖Θsvt −Θ∗‖2F ≤ 2‖Θsvt − Θ‖2F + 2‖Θ−Θ∗‖2F (5.4)

Using the Holder inequality, we control the first term as follows

‖Θsvt − Θ‖2F ≤ rank(Θsvt − Θ)‖Θsvt − Θ‖2op ≤ 2|S|‖Θsvt − Θ‖2op

Moreover,

‖Θsvt − Θ‖op ≤ ‖Θsvt − y‖op + ‖y −Θ∗‖op + ‖Θ∗ − Θ‖op

≤ maxj∈Sc

|λj |+ τ + maxj∈Sc

|λj | ≤ 6τ .

Therefore,

‖Θsvt − Θ‖2F ≤ 72|S|τ2 = 72∑j∈S

τ2 .

The second term in (5.4) can be written as

‖Θ−Θ∗‖2F =∑j∈Sc|λj |2 .


Plugging the above two displays in (5.4), we get

‖Θsvt −Θ∗‖2F ≤ 144∑j∈S

τ2 +∑j∈Sc|λj |2

Since on S, τ2 = min(τ2, |λj |2) and on Sc, |λj |2 ≤ 9 min(τ2, |λj |2), it yields,

‖Θsvt −Θ∗‖2F ≤ 144∑j

min(τ2, |λj |2)

≤ 144

rank(Θ∗)∑j=1

τ2

= 144 rank(Θ∗)τ2 .

In the next subsection, we extend our analysis to the case where X does notnecessarily satisfy the assumption ORT.

Penalization by rank

The estimator from this section is the counterpart of the BIC estimator in thespectral domain. However, we will see that unlike BIC, it can be computedefficiently.

Let Θrk be any solution to the following minimization problem:

minΘ∈IRd×T

{ 1

n‖Y− XΘ‖2F + 2τ2 rank(Θ)

}.

This estimator is called estimator by rank penalization with regularization pa-rameter τ2. It enjoys the following property.

Theorem 5.5. Consider the multivariate linear regression model (5.1). Then,the estimator by rank penalization Θrk with regularization parameter τ2, whereτ is defined in (5.3) satisfies

1

n‖XΘrk − XΘ∗‖2F ≤ 8 rank(Θ∗)τ2 .

σ2 rank(Θ∗)

n

(d ∨ T + log(1/δ)

).


Proof. We begin as usual by noting that

‖Y− XΘrk‖2F + 2nτ2 rank(Θrk) ≤ ‖Y− XΘ∗‖2F + 2nτ2 rank(Θ∗) ,

which is equivalent to

‖XΘrk − XΘ∗‖2F ≤ 2〈E,XΘrk − XΘ∗〉 − 2nτ2 rank(Θrk) + 2nτ2 rank(Θ∗) .

Next, by Young’s inequality, we have

2〈E,XΘrk − XΘ∗〉 = 2〈E,U〉2 +1

2‖XΘrk − XΘ∗‖2F ,


where

U =XΘrk − XΘ∗

‖XΘrk − XΘ∗‖F.

WriteXΘrk − XΘ∗ = ΦN ,

where Φ is a n× r, r ≤ d matrix whose columns form orthonormal basis of thecolumn span of X. The matrix Φ can come from the SVD of X for example:X = ΦΛΨ>. It yields

U =ΦN

‖N‖Fand

‖XΘrk − XΘ∗‖2F ≤ 4〈Φ>E,N/‖N‖F 〉2 − 4nτ2 rank(Θrk) + 4nτ2 rank(Θ∗) .(5.5)

Note that rank(N) ≤ rank(Θrk) + rank(Θ∗). Therefore, by Holder’s in-equality, we get

〈E,U〉2 = 〈Φ>E,N/‖N‖F 〉2

≤ ‖Φ>E‖2op

‖N‖21‖N‖2F

≤ rank(N)‖Φ>E‖2op

≤ ‖Φ>E‖2op

[rank(Θrk) + rank(Θ∗)

].

Next, note that Lemma 5.3 yields ‖Φ>E‖2op ≤ nτ2 so that

〈E,U〉2 ≤ nτ2[

rank(Θrk) + rank(Θ∗)].

Together with (5.5), this completes the proof.

It follows from Theorem 5.5 that the estimator by rank penalization enjoysthe same properties as the singular value thresholding estimator even when Xdoes not satisfy the ORT condition. This is reminiscent of the BIC estimatorwhich enjoys the same properties as the hard thresholding estimator. Howeverthis analogy does not extend to computational questions. Indeed, while therank penalty, just like the sparsity penalty, is not convex, it turns out thatXΘrk can be computed efficiently.

Note first that

minΘ∈IRd×T

1

n‖Y−XΘ‖2F + 2τ2 rank(Θ) = min

k

{ 1

nmin

Θ∈IRd×T

rank(Θ)≤k

‖Y−XΘ‖2F + 2τ2k}.

Therefore, it remains to show that

minΘ∈IRd×T

rank(Θ)≤k

‖Y− XΘ‖2F


can be solved efficiently. To that end, let Y = X(X>X)†X>Y denote the orthog-onal projection of Y onto the image space of X: this is a linear operator fromIRd×T into IRn×T . By the Pythagorean theorem, we get for any Θ ∈ IRd×T ,

‖Y− XΘ‖2F = ‖Y− Y‖2F + ‖Y− XΘ‖2F .

Next consider the SVD of Y:

Y =∑j

λjujv>j

where λ1 ≥ λ2 ≥ . . . and define Y by

Y =

k∑j=1

λjujv>j .

By Lemma 5.1, it holds

‖Y− Y‖2F = minZ:rank(Z)≤k

‖Y− Z‖2F .

Therefore, any minimizer of XΘ 7→ ‖Y− XΘ‖2F over matrices of rank at mostk can be obtained by truncating the SVD of Y at order k.

Once XΘrk has been found, one may obtain a corresponding Θrk by leastsquares but this is not necessary for our results.

Remark 5.6. While the rank penalized estimator can be computed efficiently,it is worth pointing out that a convex relaxation for the rank penalty can alsobe used. The estimator by nuclear norm penalization Θ is defined to be anysolution to the minimization problem

minΘ∈IRd×T

{ 1

n‖Y− XΘ‖2F + τ‖Θ‖1

}Clearly this criterion is convex and it can actually be implemented efficientlyusing semi-definite programming. It has been popularized by matrix comple-tion problems. Let X have the following SVD:

X =

r∑j=1

λjujv>j ,

with λ1 ≥ λ2 ≥ . . . ≥ λr > 0. It can be shown that for some appropriate choiceof τ , it holds

1

n‖XΘ− XΘ∗‖2F .

λ1

λr

σ2 rank(Θ∗)

nd ∨ T

with probability .99. However, the proof of this result is far more involvedthan a simple adaption of the proof for the Lasso estimator to the matrix case(the readers are invited to see that for themselves). For one thing, there is noassumption on the design matrix (such as INC for example). This result canbe found in [KLT11].

5.3. Covariance matrix estimation 132

5.3 COVARIANCE MATRIX ESTIMATION

Empirical covariance matrix

Let X1, . . . , Xn be n i.i.d. copies of a random vector X ∈ IRd such thatIE[XX>

]= Σ for some unknown matrix Σ � 0 called covariance matrix.

This matrix contains information about the moments of order 2 of the randomvector X. A natural candidate to estimate Σ is the empirical covariance matrixΣ defined by

Σ =1

n

n∑i=1

XiX>i .

Using the tools of Chapter 1, we can prove the following result.

Theorem 5.7. Let Y ∈ IRd be a random vector such that IE[Y ] = 0, IE[Y Y >] =Id and Y ∼ subGd(1). Let X1, . . . , Xn be n independent copies of sub-Gaussianrandom vector X = Σ1/2Y . Then IE[X] = 0, IE[XX>] = Σ and X ∼ subGd(‖Σ‖op).Moreover,

‖Σ− Σ‖op . ‖Σ‖op

(√d+ log(1/δ)

n∨ d+ log(1/δ)

n

),


Proof. It follows from elementary computations that IE[X] = 0, IE[XX>] = Σand X ∼ subGd(‖Σ‖op). To prove the deviation inequality, observe first thatwithout loss of generality, we can assume that Σ = Id. Indeed,

‖Σ− Σ‖op

‖Σ‖op=‖ 1n

∑ni=1XiX

>i − Σ‖op

‖Σ‖op

≤‖Σ1/2‖op‖ 1

n

∑ni=1 YiY

>i − Id‖op‖Σ1/2‖op

‖Σ‖op

= ‖ 1

n

n∑i=1

YiY>i − Id‖op .

Thus, in the rest of the proof, we assume that Σ = Id. Let N be a 1/4-netfor Sd−1 such that |N | ≤ 12d. It follows from the proof of Lemma 5.3 that

‖Σ− Id‖op ≤ 2 maxx,y∈N

x>(Σ− Id)y.

Hence, for any t ≥ 0, by a union bound,

IP(‖Σ− Id‖op > t

)≤∑x,y∈N

IP(x>(Σ− Id)y > t/2

). (5.6)

It holds,

x>(Σ− Id)y =1

n

n∑i=1

{(X>i x)(X>i y)− IE

[(X>i x)(X>i y)

]}.

5.3. Covariance matrix estimation 133

Using polarization, we also have

(X>i x)(X>i y) =Z2

+ − Z2−

4,

where Z+ = X>i (x+ y) and Z− = X>i (x− y). It yields

IE[

exp(s((X>i x)(X>i y)− IE

[(X>i x)(X>i y)

]))]= IE

[exp

(s4

(Z2

+ − IE[Z2+])− s

4

(Z2− − IE[Z2

−])))]

≤(

IE[

exp(s

2

(Z2

+ − IE[Z2+]))]

IE[

exp(− s

2

(Z2− − IE[Z2

−]))])1/2

,

where in the last inequality, we used Cauchy-Schwarz. Next, since X ∼subGd(1), we have Z+, Z− ∼ subG(2), and it follows from Lemma 1.12 that

Z2+ − IE[Z2

+] ∼ subE(32) , and Z2− − IE[Z2

−] ∼ subE(32).

Therefore, for any s ≤ 1/16, we have for any Z ∈ {Z+, Z−} that

IE[

exp(s

2

(Z2 − IE[Z2]

))]≤ e128s2 .

It yields that

(X>i x)(X>i y)− IE[(X>i x)(X>i y)

]∼ subE(16) .

Applying now Bernstein’s inequality (Theorem 1.13), we get

IP(x>(Σ− Id)y > t/2

)≤ exp

[− n

2(( t

32

)2

∧ t

32)]. (5.7)

Together with (5.6), this yields

IP(‖Σ− Id‖op > t

)≤ 144d exp

[− n

2(( t

32

)2

∧ t

32)]. (5.8)

In particular, the right hand side of the above inequality is at most δ ∈ (0, 1) if

t

32≥(2d

nlog(144) +

2

nlog(1/δ)

)∨(2d

nlog(144) +

2

nlog(1/δ)

)1/2

.

This concludes our proof.

Theorem 5.7 indicates that for fixed d, the empirical covariance matrix is aconsistent estimator of Σ (in any norm as they are all equivalent in finite dimen-sion). However, the bound that we got is not satisfactory in high-dimensionswhen d� n. To overcome this limitation, we can introduce sparsity as we havedone in the case of regression. The most obvious way to do so is to assumethat few of the entries of Σ are non zero and it turns out that in this case

5.4. Principal component analysis 134

thresholding is optimal. There is a long line of work on this subject (see forexample [CZZ10] and [CZ12]).

Once we have a good estimator of Σ, what can we do with it? The keyinsight is that Σ contains information about the projection of the vector Xonto any direction u ∈ Sd−1. Indeed, we have that var(X>u) = u>Σu, which

can be readily estimated by Var(X>u) = u>Σu. Observe that it follows fromTheorem 5.7 that∣∣Var(X>u)−Var(X>u)

∣∣ =∣∣u>(Σ− Σ)u

∣∣≤ ‖Σ− Σ‖op

. ‖Σ‖op

(√d+ log(1/δ)

n∨ d+ log(1/δ)

n

)with probability 1− δ.

The above fact is useful in the Markowitz theory of portfolio section forexample [Mar52], where a portfolio of assets is a vector u ∈ IRd such that|u|1 = 1 and the risk of a portfolio is given by the variance Var(X>u). Thegoal is then to maximize reward subject to risk constraints. In most instances,the empirical covariance matrix is plugged into the formula in place of Σ. (SeeProblem 5.4).

5.4 PRINCIPAL COMPONENT ANALYSIS

Spiked covariance model

Estimating the variance in all directions is also useful for dimension reduction.In Principal Component Analysis (PCA), the goal is to find one (or more)directions onto which the data X1, . . . , Xn can be projected without loosingmuch of its properties. There are several goals for doing this but perhapsthe most prominent ones are data visualization (in few dimensions, one canplot and visualize the cloud of n points) and clustering (clustering is a hardcomputational problem and it is therefore preferable to carry it out in lowerdimensions). An example of the output of a principal component analysisis given in Figure 5.1. In this figure, the data has been projected onto twoorthogonal directions PC1 and PC2, that were estimated to have the mostvariance (among all such orthogonal pairs). The idea is that when projectedonto such directions, points will remain far apart and a clustering patternwill still emerge. This is the case in Figure 5.1 where the original data isgiven by d = 500, 000 gene expression levels measured on n ' 1, 387 people.Depicted are the projections of these 1, 387 points in two dimension. Thisimage has become quite popular as it shows that gene expression levels canrecover the structure induced by geographic clustering. How is it possible to“compress” half a million dimensions into only two? The answer is that thedata is intrinsically low dimensional. In this case, a plausible assumption isthat all the 1, 387 points live close to a two-dimensional linear subspace. To seehow this assumption (in one dimension instead of two for simplicity) translates


Figure 5.1. Projection onto two dimensions of 1, 387 points from gene expression data.Source: Gene expression blog.

into the structure of the covariance matrix Σ, assume that X1, . . . , Xn areGaussian random variables generated as follows. Fix a direction v ∈ Sd−1

and let Y1, . . . , Yn ∼ Nd(0, Id) so that v>Yi are i.i.d. N (0, 1). In particular,the vectors (v>Y1)v, . . . , (v>Yn)v live in the one-dimensional space spanned byv. If one would observe such data the problem would be easy as only twoobservations would suffice to recover v. Instead, we observe X1, . . . , Xn ∈ IRd

where Xi = (v>Yi)v + Zi, and Zi ∼ Nd(0, σ2Id) are i.i.d. and independent ofthe Yis, that is we add isotropic noise to every point. If the σ is small enough,we can hope to recover the direction v (See Figure 5.2). The covariance matrixof Xi generated as such is given by

Σ = IE[XX>

]= IE

[((v>Y )v + Z)((v>Y )v + Z)>

]= vv> + σ2Id .

This model is often called the spiked covariance model. By a simple rescaling,it is equivalent to the following definition.

Definition 5.8. A covariance matrix Σ ∈ IRd×d is said to satisfy the spikedcovariance model if it is of the form

Σ = θvv> + Id ,


v

Figure 5.2. Points are close to a one dimensional space spanned by v.

where θ > 0 and v ∈ Sd−1. The vector v is called the spike.

This model can be extended to more than one spike but this extension isbeyond the scope of these notes.

Clearly, under the spiked covariance model, v is the eigenvector of thematrix Σ that is associated to its largest eigenvalue 1 + θ. We will refer tothis vector simply as largest eigenvector. To estimate it, a natural candidateis the largest eigenvector v of Σ, where Σ is any estimator of Σ. There is acaveat: by symmetry, if u is an eigenvector, of a symmetric matrix, then −u isalso an eigenvector associated to the same eigenvalue. Therefore, we may onlyestimate v up to a sign flip. To overcome this limitation, it is often useful todescribe proximity between two vectors u and v in terms of the principal anglebetween their linear span. Let us recall that for two unit vectors the principalangle between their linear spans is denoted by ∠(u, v) and defined as

∠(u, v) = arccos(|u>v|) .

The following result from perturbation theory is known as the Davis-Kahansin(θ) theorem as it bounds the sin of the principal angle between eigenspaces.This theorem exists in much more general versions that extend beyond one-dimensional eigenspaces.

Theorem 5.9 (Davis-Kahan sin(θ) theorem). Let A,B be two d × d PSDmatrices. Let (λ1, u1), . . . , (λd, ud) (resp. (µ1, v1), . . . , (µd, vd)) denote the pairsof eigenvalues–eigenvectors of A (resp. B) ordered such that λ1 ≥ λ2 ≥ . . .(resp. µ1 ≥ µ2 ≥ . . . ). Then

sin(∠(u1, v1)

)≤ 2

max(λ1 − λ2, µ1 − µ2)‖A−B‖op .

Moreover,min

ε∈{±1}|εu1 − v1|22 ≤ 2 sin2

(∠(u1, v1)

).

Proof. Note that u>1 Au1 = λ1 and for any x ∈ Sd−1, it holds

x>Ax =

d∑j=1

λj(u>j x)2 ≤ λ1(u>1 x)2 + λ2(1− (u>1 x)2)

= λ1 cos2(∠(u1, x)

)+ λ2 sin2

(∠(u1, x)

).


Therefore, taking x = v1, we get that on the one hand,

u>1 Au1 − v>1 Av1 ≥ λ1 − λ1 cos2(∠(u1, x)

)− λ2 sin2

(∠(u1, x)

)= (λ1 − λ2) sin2

(∠(u1, x)

).

On the other hand,

u>1 Au1 − v>1 Av1 = u>1 Bu1 − v>1 Av1 + u>1 (A−B)u1

≤ v>1 Bv1 − v>1 Av1 + u>1 (A−B)u1 (v1 is leading eigenvector of B)

= 〈A−B, u1u>1 − v1v

>1 〉

≤ ‖A−B‖op‖u1u>1 − v1v

>1 ‖1 (Holder)

≤ ‖A−B‖op

√2‖u1u

>1 − v1v

>1 ‖2 (Cauchy-Schwarz)

where in the last inequality, we used the fact that rank(u1u>1 − v1v

>1 ) = 2.

It is straightforward to check that

‖u1u>1 − v1v

>1 ‖22 = ‖u1u

>1 − v1v

>1 ‖2F = 2− 2(u>1 v1)2 = 2 sin2

(∠(u1, v1)

).

We have proved that

(λ1 − λ2) sin2(∠(u1, x)

)≤ 2‖A−B‖op sin

(∠(u1, x)

),

which concludes the first part of the lemma. Note that we can replace λ1 − λ2

with µ1 − µ2 since the result is completely symmetric in A and B.It remains to show the second part of the lemma. To that end, observe that

minε∈{±1}

|εu1 − v1|22 = 2− 2|u>1 v1| ≤ 2− 2(u>1 v1)2 = 2 sin2(∠(u1, x)

).

Combined with Theorem 5.7, we immediately get the following corollary.

Corollary 5.10. Let Y ∈ IRd be a random vector such that IE[Y ] = 0, IE[Y Y >] =Id and Y ∼ subGd(1). Let X1, . . . , Xn be n independent copies of sub-Gaussianrandom vector X = Σ1/2Y so that IE[X] = 0, IE[XX>] = Σ and X ∼ subGd(‖Σ‖op).Assume further that Σ = θvv>+Id satisfies the spiked covariance model. Then,the largest eigenvector v of the empirical covariance matrix Σ satisfies,

minε∈{±1}

|εv − v|2 .1 + θ

θ

(√d+ log(1/δ)

n∨ d+ log(1/δ)

n

)with probability 1− δ.

This result justifies the use of the empirical covariance matrix Σ as a re-placement for the true covariance matrix Σ when performing PCA in low di-mensions, that is when d � n. In the high-dimensional case, where d � n,the above result is uninformative. As before, we resort to sparsity to overcomethis limitation.


Sparse PCA

In the example of Figure 5.1, it may be desirable to interpret the meaning ofthe two directions denoted by PC1 and PC2. We know that they are linearcombinations of the original 500,000 gene expression levels. A natural questionto ask is whether only a subset of these genes could suffice to obtain similarresults. Such a discovery could have potential interesting scientific applicationsas it would point to a few genes responsible for disparities between Europeanpopulations.

In the case of the spiked covariance model this amounts to having a sparsev. Beyond interpretability as we just discussed, sparsity should also lead tostatistical stability as in the case of sparse linear regression for example. Toenforce sparsity, we will assume that v in the spiked covariance model is k-sparse: |v|0 = k. Therefore, a natural candidate to estimate v is given by vdefined by

v>Σv = maxu∈Sd−1

|u|0=k

u>Σu .

It is easy to check that λkmax(Σ) = v>Σv is the largest of all leading eigenvaluesamong all k × k sub-matrices of Σ so that the maximum is indeed attained,though there my be several maximizers. We call λkmax(Σ) the k-sparse leadingeigenvalue of Σ and v a k-sparse leading eigenvector.

Theorem 5.11. Let Y ∈ IRd be a random vector such that IE[Y ] = 0, IE[Y Y >] =Id and Y ∼ subGd(1). Let X1, . . . , Xn be n independent copies of sub-Gaussianrandom vector X = Σ1/2Y so that IE[X] = 0, IE[XX>] = Σ and X ∼ subGd(‖Σ‖op).Assume further that Σ = θvv> + Id satisfies the spiked covariance model forv such that |v|0 = k ≤ d/2. Then, the k-sparse largest eigenvector v of theempirical covariance matrix satisfies,

minε∈{±1}

|εv − v|2 .1 + θ

θ

(√k log(ed/k) + log(1/δ)

n∨ k log(ed/k) + log(1/δ)

n

).


Proof. We begin by obtaining an intermediate result of the Davis-Kahan sin(θ)theorem. Using the same steps as in the proof of Theorem 5.9, we get

v>Σv − v>Σv ≤ 〈Σ− Σ, vv> − vv>〉

Since both v and v are k sparse, there exists a (random) set S ⊂ {1, . . . , d}such that |S| ≤ 2k and {vv> − vv>}ij = 0 if (i, j) /∈ S2. It yields

〈Σ− Σ, vv> − vv>〉 = 〈Σ(S)− Σ(S), v(S)v(S)> − v(S)v(S)>〉

Where for any d×d matrix M , we defined the matrix M(S) to be the |S|× |S|sub-matrix of M with rows and columns indexed by S and for any vector


x ∈ IRd, x(S) ∈ IR|S| denotes the sub-vector of x with coordinates indexed byS. It yields by Holder’s inequality that

v>Σv − v>Σv ≤ ‖Σ(S)− Σ(S)‖op‖v(S)v(S)> − v(S)v(S)>‖1 .

Following the same steps as in the proof of Theorem 5.9, we get now that

sin(∠(v, v)

)≤ 2

θsup

S : |S|=2k

‖Σ(S)− Σ(S)‖op .

To conclude the proof, it remains to control supS : |S|=2k ‖Σ(S)− Σ(S)‖op. Tothat end, observe that

IP[

supS : |S|=2k

‖Σ(S)− Σ(S)‖op > t‖Σ‖op

]≤

∑S : |S|=2k

IP[

supS : |S|=2k

‖Σ(S)− Σ(S)‖op > t‖Σ(S)‖op

]≤(d

2k

)1442k exp

[− n

2(( t

32

)2

∧ t

32)].

where we used (5.8) in the second inequality. Using Lemma 2.7, we get thatthe right-hand side above is further bounded by

exp[− n

2(( t

32

)2

∧ t

32) + 2k log(144) + k log

( ed2k

)]Choosing now t such that

t ≥ C√k log(ed/k) + log(1/δ)

n∨ k log(ed/k) + log(1/δ)

n,

for large enough C ensures that the desired bound holds with probability atleast 1− δ.

Minimax lower bound

The last section has established that the spike v in the spiked covariance modelΣ = θvv> may be estimated at the rate θ−1

√k log(d/k)/n, assuming that θ ≤ 1

and k log(d/k) ≤ n, which corresponds to the interesting regime. Using thetools from Chapter 4, we can show that this rate is in fact optimal already inthe Gaussian case.

Theorem 5.12. Let X1, . . . , Xn be n independent copies of the Gaussian ran-dom vector X ∼ Nd(0,Σ) where Σ = θvv> + Id satisfies the spiked covariancemodel for v such that |v|0 = k and write IPv the distribution of X under thisassumption and by IEv the associated expectation. Then, there exists constantsC, c > 0 such that for any n ≥ 2, d ≥ 9, 2 ≤ k ≤ (d+ 7)/8, it holds

infv

supv∈Sd−1

|v|0=k

IPnv

[min

ε∈{±1}|εv − v|2 ≥ C

1

θ

√k

nlog(ed/k)

]≥ c .


Proof. We use the general technique developed in Chapter 4. Using this tech-nique, we need to provide the a set v1, . . . , vM ∈ Sd−1 of k sparse vectors suchthat

(i) minε∈{−1,1}

|εvj − vk|2 > 2ψ for all 1 ≤ j < k ≤M ,

(ii) KL(IPvj , IPvk) ≤ c log(M)/n ,

where ψ = Cθ−1√k log(d/k)/n. Akin to the Gaussian sequence model, our

goal is to employ the sparse Varshamov-Gilbert lemma to construct the vj ’s.However, the vj ’s have the extra constraint that they need to be of unit normso we cannot simply rescale the sparse binary vectors output by the sparseVarshamov-Gilbert lemma. Instead, we use the last coordinate to make surethat they have unit norm.

Recall that it follows from the sparse Varshamov-Gilbert Lemma 4.14 thatthere exits ω1, . . . , ωM ∈ {−1, 1}d−1 such that |ωj |0 = k − 1, ρ(ωi, ωj) ≥(k − 1)/2 for all i 6= j and log(M) ≥ k−1

8 log(1 + d−12k−2 ). Define vj = (γωj , `)

where γ < 1/√k − 1 is to be defined later and ` =

√1− γ2(k − 1). It is

straightforward to check that |vj |0 = k, |vj |2 = 1 and vj has only nonnegativecoordinates. Moreover, for any i 6= j,

minε∈±1

|εvi − vj |22 = |vi − vj |22 = γ2ρ(vi, vj) ≥ γ2 k − 1

2. (5.9)

We choose

γ2 =Cγθ2n

log(dk

),

so that γ2(k− 1) = ψ and (i) is verified. It remains to check (ii). To that end,note that

KL(IPu, IPv) =1

2IEu

(X>[(Id + θvv>)−1 − (Id + θvv>)−1

]X)

=1

2IEuTr

(X>[(Id + θvv>)−1 − (Id + θvv>)−1

]X)

=1

2Tr([

(Id + θvv>)−1 − (Id + θuu>)−1](Id + θuu>)

)Next, we use the Sherman-Morrison formula1:

(Id + θuu>)−1 = Id −θ

1 + θuu> , ∀u ∈ Sd−1 .

1To see this, check that (Id + θuu>)(Id − θ1+θ

) = Id and that the two matrices on the

left-hand side commute.

5.5. Graphical models 141

It yields

KL(IPu, IPv) =θ

2(1 + θ)Tr(

(uu> − vv>)(Id + θuu>))

=θ2

2(1 + θ)(1− (u>v)2)

=θ2

2(1 + θ)sin2

(∠(u, v)

)It yields

KL(IPvi , IPvj ) =θ2

2(1 + θ)sin2

(∠(vi, vj)

).

Next, note that

sin2(∠(vi, vj)

)= 1− γ2ω>i ωj − `2 ≤ γ2(k − 1) ≤ Cγ

k

θ2nlog(dk

)≤ logM

θ2n,

for Cγ small enough. We conclude that (ii) holds, which completes our proof.

Together with the upper bound of Theorem 5.11, we get the following corol-lary.

Corollary 5.13. Assume that θ ≤ 1 and k log(d/k) ≤ n. Then θ−1√k log(d/k)/n

is the minimax rate of estimation over B0(k) in the spiked covariance model.

5.5 GRAPHICAL MODELS

Gaussian graphical models

In Section 5.3, we noted that thresholding is an appropriate way to gain ad-vantage from sparsity in the covariance matrix when trying to estimate it.However, in some cases, it is more natural to assume sparsity in the inverse ofthe covariance matrix, Θ = Σ−1, which is sometimes called concentration orprecision matrix. In this chapter, we will develop an approach that allows usto get guarantees similar to the lasso for the error between the estimate andthe ground truth matrix in Frobenius norm.

We can understand why sparsity in Θ can be appropriate in the context ofundirected graphical models [Lau96]. These are models that serve to simplifydependence relations between high dimensional distributions according to agraph.

Definition 5.14. Let G = (V,E) be a graph on d nodes and to each nodev, associate a random variable Xv. An undirected graphical model or Markovrandom field is a collection of probability distributions over Xv that factorizeaccording to

p(x1, . . . , xd) =1

Z

∏C∈C

ψC(xC), (5.10)


where C is the collection of all cliques (completely connected subsets) in G,xC is the restriction of (x1, . . . , xd) to the subset C, and ψC are non-negativepotential functions.

This definition implies certain conditional independence relations, such asthe following.

Proposition 5.15. A graphical model as in (5.10) fulfills the global Markovproperty. That is, for any triple (A,B, S) of disjoint subsets such that S sepa-rates A and B,

A ⊥⊥ B | S.

Proof. Let (A,B, S) a triple as in the proposition statement and define A tobe the connected component in the induced subgraph G[V \ S], as well asB = G[V \ (A∪S)]. By assumption, A and B have to be in different connectedcomponents of G[V \ S], hence any clique in G has to be a subset of A ∪ S orB ∪ S. Denoting the former cliques by CA, we get

f(x) =∏C∈C

ψC(x) =∏C∈CA

ψC(x)∏

c∈C\CA

ψC(x) = h(xA∪S)k(xB∪S),

which implies the desired conditional independence.

A weaker form of the Markov property is the so-called pairwise Markovproperty, which says that the above holds only for singleton sets of the formA = {a}, B = {b} and S = V \ {a, b} if (a, b) /∈ E.

In the following, we will focus on Gaussian graphical models for n i.i.d.samples Xi ∼ N (0, (Θ∗)−1). By contrast, there is a large body of work ondiscrete graphical models such as the Ising model (see for example [Bre15]),which we are not going to discuss here, although links to the following can beestablished as well [LWo12].

For a Gaussian models with mean zero, by considering the density in termsof the concentration matrix Θ,

fΘ(x) ∝ exp(−x>Θx/2),

it is immediate that the pairs a, b for which the pairwise Markov propertyholds are exactly those for which Θi,j = 0. Conversely, it can be shown thatfor a family of distributions with non-negative density, this actually implies thefactorization (5.10) according to the graph given by the edges {(i, j) : Θi,j 6= 0}.This is the content of the Hammersley-Clifford theorem.

Theorem 5.16 (Hammersley-Clifford). Let P be a probability distribution withpositive density f with respect to a product measure over V . Then P factorizesover G as in (5.10) if and only if it fulfills the pairwise Markov property.

We show the if direction. The requirement that f be positive allows us totake logarithms. Pick a fixed element x∗. The idea of the proof is to rewrite thedensity in a way that allows us to make use of the pairwise Markov property.The following combinatorial lemma will be used for this purpose.


Lemma 5.17 (Moebius Inversion). Let Ψ,Φ : P(V )→ IR be functions definedfor all subsets of a set V . Then, the following are equivalent:

1. For all A ⊆ V : Ψ(A) =∑B⊆A Φ(B);

2. For all A ⊆ V : Φ(A) =∑B⊆A(−1)|A\BΨ(B).

Proof. 1. =⇒ 2.:∑B⊆A

Φ(B) =∑B⊆A

∑C⊆B

(−1)|B\C|Ψ(C)

=∑C⊆A

Ψ(C)

∑C⊆B⊆A

(−1)|B\C|

=∑C⊆A

Ψ(C)

∑H⊆A\C

(−1)|H|

.

Now,∑H⊆A\C(−1)|H| = 0 unless A \ C = ∅ because any non-empty set has

the same number of subsets with odd and even cardinality, which can be seenby induction.

Proof. DefineHA(x) = log f(xA, x

∗Ac)

andΦA(x) =

∑B⊆A

(−1)|A\B|HB(x).

Note that both HA and ΦA depend on x only through xA. By Lemma 5.17,we get

log f(x) = HV (x) =∑A⊆V

ΦA(x).

It remains to show that ΦA = 0 if A is not a clique. Assume A ⊆ V withv, w ∈ A, (v, w) /∈ E, and set C = V \ {v, w}. By adding all possible subsetsof {v, w} to every subset of C, we get every subset of A, and hence

ΦA(x) =∑B⊆C

(−1)|C\B|(HB −HB∪{v} −HB∪{w} +HB∪{v,w}).


Writing D = V \ {v, w}, by the pairwise Markov property, we have

HB∪{v,w}(x)−HB∪{v}(x) = logf(xv, xw, xB , x

∗D\B

f(xv, x∗w, xB , x∗D\B

= logf(xv|xB , x∗D\B)f(xw, xB , x

∗D\B)

f(xv|xB , x∗D\B)f(x∗w, xB , x∗D\B)

= logf(x∗v|xB , x∗D\B)f(xw, xB , x

∗D\B)

f(x∗v|xB , x∗D\B)f(x∗w, xB , x∗D\B)

= logf(x∗v, xw, xB , x

∗D\B)

f(x∗v, x∗w, xB , x

∗D\B)

= HB∪{w}(x)−HB(x).

Thus ΨA vanishes if A is not a clique.

In the following, we will show how to exploit this kind of sparsity whentrying to estimate Θ. The key insight is that we can set up an optimizationproblem that has a similar error sensitivity as the lasso, but with respect to Θ.In fact, it is the penalized version of the maximum likelihood estimator for theGaussian distribution. We set

Θλ = argminΘ�0

Tr(ΣΘ)− log det(Θ) + λ‖ΘDc‖1, (5.11)

where we wrote ΘDc for the off-diagonal restriction of Θ, ‖A‖1 =∑i,j |Ai,j | for

the element-wise `1 norm, and Σ = 1n

∑iXiX

>i for the empirical covariance

matrix of n i.i.d. observations. We will also make use of the notation

‖A‖∞ = maxi,j∈[d]

|Ai,j |

for the element-wise ∞-norm of a matrix A ∈ IRd×d and

‖A‖∞,∞ = maxi

∑j

|Ai,j |

for the ∞ operator norm.We note that the function Θ 7→ − log det(Θ) has first derivative −Θ−1 and

second derivative Θ−1 ⊗ Θ−1, which means it is convex. Derivations for thiscan be found in [BV04, Appendix A.4]. This means that in the population caseand for λ = 0, the minimizer in (5.11) would coincide with Θ∗.

First, let us derive a bound on the error in Σ that is similar to Theorem 1.14,with the difference that we are dealing with sub-Exponential distributions.

Lemma 5.18. If Xi ∼ N (0,Σ) are i.i.d. and Σ = 1n

∑ni=1XiX

>i denotes the

empirical covariance matrix, then

‖Σ− Σ‖∞ . ‖Σ‖op

√log(d/δ)

n,

with probability at least 1− δ, if n & log(d/δ).


Proof. Start as in the proof of Theorem 5.7 by writing Y = Σ−1/2X ∼ N (0, Id)and notice that

|e>i (Σ− Σ)ej | = |(Σ1/2ei)>

(1

n

n∑i=1

YiY>i − Id

)(Σ1/2ej)|

≤ ‖Σ1/2‖2op|v>(

1

n

n∑i=1

YiY>i − Id

)w|

with v, w ∈ Sd−1. Hence, without loss of generality, we can assume Σ = Id.The remaining term is sub-Exponential and can be controlled as in (5.7)

IP(e>i (Σ− Id)ej > t

)≤ exp

[− n

2(( t

16

)2

∧ t

16)].

By a union bound, we get

IP(‖Σ− Id‖∞ > t

)≤ d2 exp

[− n

2(( t

16

)2

∧ t

16)].

Hence, if n & log(d/δ), then the right-hand side above is at most δ ∈ (0, 1) if

t

16≥(

2 log(d/δ)

n

)1/2

.

Theorem 5.19. Let Xi ∼ N (0,Σ) be i.i.d. and Σ = 1n

∑ni=1XiX

>i denote

the empirical covariance matrix. Assume that |S| = k, where S = {(i, j) : i 6=j, Θi,j 6= 0}. There exists a constant c = c(‖Σ‖op) such that if

λ = c

√log(d/δ)

n,

and n & log(d/δ), then the minimizer in (5.11) fulfills

‖Θ−Θ∗‖2F .(p+ k) log(d/δ)

n


Proof. We start by considering the likelihood

l(Θ,Σ) = Tr(ΣΘ)− log det(Θ).

Taylor expanding it and the mean value theorem yield

l(Θ,Σn)− l(Θ∗,Σn) = Tr(Σn(Θ−Θ∗))− Tr(Σ∗(Θ−Θ∗))

+1

2Tr(Θ−1(Θ−Θ∗)Θ−1(Θ−Θ∗)),


for some Θ = Θ∗+ t(Θ−Θ∗), t ∈ [0, 1]. Note that essentially by the convexityof log det, we have

Tr(Θ−1(Θ−Θ∗)Θ−1(Θ−Θ∗)) = ‖Θ−1(Θ−Θ∗)‖2F ≥ λmin(Θ−1)2‖Θ−Θ∗‖2Fand

λmin(Θ−1) = (λmax(Θ))−1.

If we write ∆ = Θ−Θ∗, then for ‖∆‖F ≤ 1,

λmax(Θ) = ‖Θ‖op = ‖Θ∗+∆‖op ≤ ‖Θ∗‖op+‖∆‖op ≤ ‖Θ∗‖op+‖∆‖F ≤ ‖Θ∗‖op+1,

and therefore

l(Θ,Σn)− l(Θ∗,Σn) ≥ Tr((Σn − Σ∗)(Θ−Θ∗)) + c‖Θ−Θ∗‖2F ,

for c = (‖Θ∗‖op + 1)−2/2, if ‖∆‖F ≤ 1.This takes care of the case where ∆ is small. To handle the case where it

is large, define g(t) = l(Θ∗ + t∆,Σn)− l(Θ∗,Σn). By the convexity of l in Θ,

g(1)− g(0)

1≥ g(t)− g(0)

t,

so that plugging in t = ‖∆‖−1F gives

l(Θ,Σn)− l(Θ∗,Σn) ≥ ‖∆‖F(l(Θ∗ +

1

‖∆‖F∆,Σn)− l(Θ∗,Σn)

)≥ ‖∆‖F

(Tr((Σn − Σ∗)

1

‖∆‖F∆) + c

)= Tr((Σn − Σ∗)∆) + c‖∆‖F ,

for ‖∆‖F ≥ 1.If we now write ∆ = Θ−Θ∗ and assume ‖∆‖F ≥ 1, then by optimality of

Θ,

‖∆‖F ≤ C [Tr((Σ∗ − Σn)∆) + λ(‖Θ∗Dc‖1 − ‖ΘDc‖1)]

≤ C [Tr((Σ∗ − Σn)∆) + λ(‖∆S‖1 − ‖∆Sc‖1)] ,

where S = {(i, j) ∈ Dc : Θ∗i,j 6= 0} by triangle inequality. Now, split the errorcontributions for the diagonal and off-diagonal elements,

Tr((Σ∗ − Σn)∆) + λ(‖∆S‖1 − ‖∆Sc‖1)

≤ ‖(Σ∗ − Σn)D‖F ‖∆D‖F + ‖(Σ∗ − Σn)Dc‖∞‖∆Dc‖1+ λ(‖∆S‖1 − ‖∆Sc‖1).

By Holder inequality, ‖(Σ∗ −Σn)D‖F ≤√d‖Σ∗ −Σn‖∞, and by Lemma 5.18,

‖Σ∗ − Σn‖∞ ≤√

log(d/δ)/√n for n & log(d/δ). Combining these two esti-

mates,

Tr((Σ∗ − Σn)∆) + λ(‖∆S‖1 − ‖∆Sc‖1)

.

√d log(d/δ)

n‖∆D‖F +

√log(d/δ)

n‖∆Dc‖1 + λ(‖∆S‖1 − ‖∆Sc‖1)


Setting λ = C√

log(ep/δ)/n and splitting ‖∆Dc‖1 = ‖∆S‖1 + ‖∆Sc‖1 yields√d log(d/δ)

n‖∆D‖F +

√log(d/δ)

n‖∆Dc‖1 + λ(‖∆S‖1 − ‖∆Sc‖1)

≤√d log(d/δ)

n‖∆D‖F +

√log(d/δ)

n‖∆S‖1

≤√d log(d/δ)

n‖∆D‖F +

√k log(d/δ)

n‖∆S‖F

≤√

(d+ k) log(d/δ)

n‖∆‖F

Combining this with a left-hand side of ‖∆‖F yields ‖∆‖F = 0 for n &(d + k) log(d/δ), a contradiction to ‖∆‖F ≥ 1 where this bound is effective.Combining it with a left-hand side of ‖∆‖2F gives us

‖∆‖2F .(d+ k) log(d/δ)

n,

as desired.

Write Σ∗ = W ∗Γ∗W ∗, where W 2 = Σ∗D and similarly define W by W 2 =

ΣD. Define a slightly modified estimator by replacing Σ in (5.11) by Γ to getan estimator K for K = (Γ∗)−1.

‖Γ− Γ‖∞ = ‖W−1ΣW−1 − (W ∗)−1Σ∗(W ∗)−1‖∞ (5.12)

≤ ‖W−1 − (W ∗)−1‖∞,∞‖Σ− Σ∗‖∞‖W−1 − (W ∗)−1‖∞,∞

+ ‖W−1 − (W ∗)−1‖∞,∞(‖Σ‖∞‖(W ∗)−1‖∞ + ‖Σ∗‖∞‖W−1‖∞,∞

)+ ‖W−1‖∞,∞‖Σ− Σ∗‖∞‖(W ∗)−1‖∞,∞ (5.13)

Note that ‖A‖∞ ≤ ‖A‖op, so that if n & log(d/δ), then ‖Γ−Γ‖∞ .√

log(d/δ)/√n.

From the above arguments, conclude ‖K − K∗‖F .√k log(d/δ)/n and

with a calculation similar to (5.13) that

‖Θ′ −Θ∗‖op .

√k log(d/δ)

n.

This will not work in Frobenius norm since there, we only have ‖W 2−W 2‖2F .√p log(d/δ)/n.

Lower bounds

Here, we will show that the bounds in Theorem 5.19 are optimal up to logfactors. It will again be based on applying Fano’s inequality, Theorem 4.10.


In order to calculate the KL divergence between two Gaussian distributionswith densities fΣ1 and fΣ2 , we first observe that

logfΣ1

(x)

fΣ2(x)= − 1

2x>Θ1x+

1

2log det(Θ1) +

1

2x>Θ2x−

1

2log det(Θ2)

, so that

KL(IPΣ1, IPΣ2

) = IEΣ1log

(fΣ1

fΣ2

(X)

)=

1

2

[log det(Θ1)− log det(Θ2) + IE[Tr((Θ2 −Θ1)XX>)]

]=

1

2[log det(Θ1)− log det(Θ2) + Tr((Θ2 −Θ1)Σ1)]

=1

2[log det(Θ1)− log det(Θ2) + Tr(Θ2Σ1)− p] .

writing ∆ = Θ2 −Θ1, Θ = Θ1 + t∆, t ∈ [0, 1], Taylor expansion then yields

KL(IPΣ1 , IPΣ2) =1

2

[−Tr(∆Σ1) +

1

2Tr(Θ−1∆Θ−1∆) + Tr(∆Σ1)

]≤ 1

4λmax(Θ−1)‖∆‖2F

=1

4λ−1

min(Θ)‖∆‖2F .

This means we are in good shape in applying our standard tricks if we canmake sure that λmin(Θ) stays large enough among our hypotheses. To this end,assume k ≤ p2/16 and define the collection of matrices (Bj)j∈[M1] with entriesin {0, 1} such that they are symmetric, have zero diagonal and only have non-zero entries in a band of size

√k around the diagonal, and whose entries are

given by a flattening of the vectors obtained by the sparse Varshamov-GilbertLemma 4.14. Moreover, let (Cj)j∈[M2] be diagonal matrices with entries in{0, 1} given by the regular Varshamov-Gilbert Lemma 4.12. Then, define

Θj = Θj1,j2 = I +β√n

(√

log(1 + d/(2k))Bj1 + Cj2), j1 ∈ [M1], j2 ∈ [M2].

We can ensure that λmin(Θ) is small by the Gershgorin circle theorem. Foreach row of Θ− I,∑

l

|Θil| ≤2β(√k + 1)√n

√log(1 + d/(2k) <

1

2

if n & k/ log(1 + d/(2k)).Hence,

‖Θj −Θl‖2F =β2

n(ρ(Bj1 , Bl1) log(1 + d/(2k)) + ρ(Cj2 , Cl2))

&β2k

n(k log(1 + d/(2k)) + d),


and

‖Θj −Θl‖2F .β2

n(k log(1 + d/(2k)) + d)

≤ β2

nlog(M1M2),

which yields the following theorem.

Theorem 5.20. Denote by Bk the set of positive definite matrices with at mostk non-zero off-diagonal entries and assume n & k log(d). Then, if IPΘ denotesa Gaussian distribution with inverse covariance matrix Θ and mean 0, thereexists a constant c > 0 such that

infΘ

supΘ∗∈Bk

IP⊗nΘ∗

(‖Θ−Θ∗‖2F ≥ c

α2

n(k log(1 + d/(2k)) + d)

)≥ 1

2− 2α.

Ising model

Like the Gaussian graphical model, the Ising model is also a model of pairwiseinteractions but for random variables that take values in {−1, 1}d.

Before developing our general approach, we describe the main ideas in thesimplest Markov random field: an Ising model without2 external field [VMLC16].Such models specify the distribution of a random vector Z = (Z(1), . . . , Z(d)) ∈{−1, 1}d as follows

IP(Z = z) = exp(z>Wz − Φ(W )

), (5.14)

where W ∈ IRpd×d is an unknown matrix of interest with null diagonal elementsand

Φ(W ) = log∑

z∈{−1,1}dexp

(z>Wz

),

is a normalization term known as log-partition function.Fix β, λ > 0. Our goal in this section is to estimate the parameterW subject

to the constraint that the jth row e>j W of W satisfies |e>j W |1 ≤ λ. To that end,we observe n independent copies Z1, . . . Zn of Z ∼ IP. To that end, estimatethe matrix W row by row using constrained likelihood maximization. Recallfrom [RWL10, Eq. (4)] that for any j ∈ [d], it holds for any z(j) ∈ {−1, 1}, that

IP(Z(j) = z(j)|Z(¬j) = z(¬j)) =exp(2z(j)e>j Wz)

1 + exp(2z(j)e>j Wz), (5.15)

where we used the fact that the diagonal elements W are equal to zero.

2The presence of an external field does not change our method. It merely introducesan intercept in the logistic regression problem and comes at the cost of more cumbersomenotation. All arguments below follow, potentially after minor modifications and explicitcomputations are left to the reader.


Therefore, the jth row e>j W of W may be estimated by performing a logistic

regression of Z(j) onto Z(¬j) subject to the constraint that |e>j W |1 ≤ λ and

‖e>j W‖∞ ≤ β. To simplify notation, fix j and define Y = Z(j), X = Z(¬j)

and w∗ = (e>j W )(¬j). Let (X1, Y1), . . . , (Xn, Yn) be n independent copies of

(X,Y ). Then the (rescaled) log-likelihood of a candidate parameter w ∈ IRd−1

is denoted by ¯n(w) and defined by

¯n(w) =

1

n

n∑i=1

log( exp(2YiX

>i w)

1 + exp(2YiX>i w)

)=

1

n

n∑i=1

YiX>i w −

1

n

n∑i=1

log(1 + exp(2YiX

>i w)

)With this representation, it is easy to see that w 7→ ¯

n(w) is a concave function.The (constrained) maximum likelihood estimator (MLE) w ∈ IRd−1 is definedto be any solution of the following convex optimization problem:

w ∈ argmaxw∈B1(λ)

¯n(w) (5.16)

This problem can be solved very efficiently using a variety of methods similarto the Lasso.

The following lemma follows from [Rig12].

Lemma 5.21. Fix δ ∈ (0, 1). Conditionally on (X1, . . . , Xn), the constrainedMLE w defined in (5.16) satisfies with probability at least 1− δ that

1

n

n∑i=1

(X>i (w − w))2 ≤ 2λeλ√

log(2(p− 1)/δ)

2n

Proof. We begin with some standard manipulations of the log-likelihood. Notethat

log( exp(2YiX

>i w)

1 + exp(2YiX>i w)

)= log

( exp(YiX>i w)

exp(−YiX>i w) + exp(YiX>i w)

)= log

( exp(YiX>i w)

exp(−X>i w) + exp(X>i w)

)= (Yi + 1)X>i w − log

(1 + exp(2X>i w)

)Therefore, writing Yi = (Yi+ 1)/2 ∈ {0, 1}, we get that w is the solution to thefollowing minimization problem:

w = argminw∈B1(λ)

κn(w)

where

κn(w) =1

n

n∑i=1

{− 2YiX

>i w + log

(1 + exp(2X>i w)

)}


For any w ∈ IRd−1, write κ(w) = IE[κn(w)] where here and throughout thisproof, all expectations are implicitly taken conditionally on X1, . . . , Xn. Ob-serve that

κ(w) = −IE[Y1]X>1 w + log(1 + exp(2X>i w)

).

Next, we get from the basic inequality κn(w) ≤ κn(w) for any w ∈ B1(λ)that

κ(w)− κ(w) ≤ 1

n

n∑i=1

(Yi − IE[Yi])X>i (w − w)

≤ 2λ

nmax

1≤j≤p−1

∣∣∣ n∑i=1

(Yi − IE[Yi])X>i ej

∣∣∣ ,where in the second inequality, we used Holder’s inequality and the fact |Xi|∞ ≤1 for all i ∈ [n]. Together with Hoeffding’s inequality and a union bound, ityields

κ(w)− κ(w) ≤ 2λ

√log(4(p− 1)/δ)

2n, with probability 1− δ

2.

Note that the Hessian of κ is given by

∇2κ(w) =1

n

n∑i=1

4 exp(2X>i w)

(1 + exp(2X>i w))2XiX

>i .

Moreover, for any w ∈ B1(λ), since ‖Xi‖∞ ≤ 1, it holds

4 exp(2X>i w)

(1 + exp(2X>i w))2≥ 4e2λ

(1 + e2λ)2≥ e−2λ

Next, since w∗ is a minimizer of w 7→ κ(w), we get from a second order Taylorexpansion that

κ(w)− κ(w∗) ≥ e−2λ 1

n

n∑i=1

(X>i (w − w))2 .

This completes our proof.

Next, we bound from below the quantity

1

n

n∑i=1

(X>i (w − w))2 .

To that end, we must exploit the covariance structure of the Ising model. Thefollowing lemma is similar to the combination of Lemmas 6 and 7 in [VMLC16].


Lemma 5.22. Fix R > 0, δ ∈ (0, 1) and let Z ∈ IRd be distributed accordingto (5.14). The following holds with probability 1−δ/2, uniformly in u ∈ B1(2λ):

1

n

n∑i=1

(Z>i u)2 ≥ 1

2|u|2∞ exp(−2λ)− 4λ

√log(2p(p− 1)/δ)

2n

Proof. Note that

1

n

n∑i=1

(Z>i u)2 = u>Snu ,

where Sn ∈ IRd×d denotes the sample covariance matrix of Z1, . . . , Zn, definedby

Sn =1

n

n∑i=1

XiX>i .

Observe thatu>Snu ≥ u>Σu− 2λ|Sn − S|∞ , (5.17)

where S = IE[Sn]. Using Hoeffding’s inequality together with a union bound,we get that

|Sn − S|∞ ≤ 2

√log(2p(p− 1)/δ)

2n

with probability 1− δ/2.Moreover, u>Su = var(X>u). Assume without loss of generality that

|u(1)| = |u|∞ and observe that

var(Z>u) = IE[(Z>u)2] = (u(1))2+IE[( p∑

j=2

u(j)Z(j))2]

+2IE[u(1)Z(1)

p∑j=2

u(j)Z(j)].

To control the cross term, let us condition on the neighborhood Z(¬1) to get

2∣∣∣IE[u(1)Z(1)

p∑j=2

u(j)Z(j)]∣∣∣ = 2

∣∣∣IE(IE[u(1)Z(1)|Z(¬1)

] p∑j=2

u(j)Z(j)])∣∣∣

≤ IE((

IE[u(1)Z(1)|Z(¬1)

])2)+ IE

(( p∑j=2

u(j)Z(j))2)

,

where we used the fact that 2|ab| ≤ a2 + b2 for all a, b ∈ IR. The above twodisplays together yield

var(Z>u) ≥ |u|2∞(1− supz∈{−1,1}p−1

(IE[Z(1)|Z(¬1) = z]

)2)To control the above supremum, recall that from (5.15), we have

supz∈{−1,1}p−1

IE[Z(1)|Z(¬1) = z] = supz∈{−1,1}p

exp(2e>1 Wz)− 1

exp(2e>1 Wz) + 1=

exp(2λ)− 1

exp(2λ) + 1.


Therefore,

var(Z>u) ≥ |u|2∞1 + exp(2λ)

≥ 1

2|u|2∞ exp(−λ) .

Together with (5.17), it yields the desired result.

Combining the above two lemmas immediately yields the following theorem.

Theorem 5.23. Let w be the constrained maximum likelihood estimator (5.16)and let w∗ = (e>j W )(¬j) be the jth row of W with the jth entry removed. Thenwith probability 1− δ we have

|w − w∗|2∞ ≤ 9λ exp(3λ)

√log(2p2/δ)

n.

5.6. Problem set 154

5.6 PROBLEM SET

Problem 5.1. Using the results of Chapter 2, show that the following holdsfor the multivariate regression model (5.1).

1. There exists an estimator Θ ∈ IRd×T such that

1

n‖XΘ− XΘ∗‖2F . σ2 rT

n

with probability .99, where r denotes the rank of X .

2. There exists an estimator Θ ∈ IRd×T such that

1

n‖XΘ− XΘ∗‖2F . σ2 |Θ∗|0 log(ed)

n.


Problem 5.2. Consider the multivariate regression model (5.1) where Y hasSVD:

Y =∑j

λj uj v>j .

Let M be defined by

M =∑j

λj1I(|λj | > 2τ)uj v>j , τ > 0 .

1. Show that there exists a choice of τ such that

1

n‖M − XΘ∗‖2F .

σ2 rank(Θ∗)

n(d ∨ T )


2. Show that there exists a matrix n×n matrix P such that PM = XΘ forsome estimator Θ and

1


σ2 rank(Θ∗)

n(d ∨ T )


3. Comment on the above results in light of the results obtain in Section 5.2.

Problem 5.3. Consider the multivariate regression model (5.1) and define Θbe the any solution to the minimization problem

minΘ∈IRd×T

{ 1

n‖Y− XΘ‖2F + τ‖XΘ‖1

}


1. Show that there exists a choice of τ such that

1


σ2 rank(Θ∗)

n(d ∨ T )


[Hint:Consider the matrix∑j

λj + λ∗j2

uj v>j

where λ∗1 ≥ λ∗2 ≥ . . . and λ1 ≥ λ2 ≥ . . . are the singular values

of XΘ∗ and Y respectively and the SVD of Y is given by

Y =∑j

λj uj v>j

2. Find a closed form for XΘ.

Problem 5.4. In the Markowitz theory of portfolio selection [Mar52], a port-

folio may be identified to a vector u ∈ IRd such that uj ≥ 0 and∑dj=1 uj = 1.

In this case, uj represent the proportion of the portfolio invested in asset j.The vector of (random) returns of d assets is denoted by X ∈ IRd and weassume that X ∼ subGd(1) and IE[XX>] = Σ unknown.

In this theory, the two key characteristics of a portfolio u are it’s rewardµ(u) = IE[u>X] and its risk R(u) = varX>u. According to this theory oneshould fix a minimum reward λ > 0 and choose the optimal portfolio

u∗ = argminu :µ(u)≥λ

R(u)

when a solution exists for a given. It is the portfolio that has minimum riskamong all portfolios with reward at least λ, provided such portfolios exist.

In practice, the distribution of X is unknown. Assume that we observe nindependent copies X1, . . . , Xn of X and use them to compute the followingestimators of µ(u) and R(u) respectively:

µ(u) = X>u =1

n

n∑i=1

X>i u ,

R(u) = u>Σu, Σ =1

n− 1

n∑i=1

(Xi − X)(Xi − X)> .

We use the following estimated portfolio:

u = argminu :µ(u)≥λ

R(u)

We assume throughout that log d� n� d .


1. Show that for any portfolio u, it holds∣∣µ(u)− µ(u)∣∣ . 1√

n,

and ∣∣R(u)−R(u)∣∣ . 1√

n.

2. Show that

R(u)−R(u) .

√log d

n,


3. Define the estimator u by:

u = argminu :µ(u)≥λ−ε

R(u)

find the smallest ε > 0 (up to multiplicative constant) such that we haveR(u) ≤ R(u∗) with probability .99.

Bibliography

[AS08] Noga Alon and Joel H. Spencer. The probabilistic method. Wiley-Interscience Series in Discrete Mathematics and Optimization.John Wiley & Sons, Inc., Hoboken, NJ, third edition, 2008. Withan appendix on the life and work of Paul Erdos.

[AW02] Rudolf Ahlswede and Andreas Winter. Strong converse for iden-tification via quantum channels. IEEE Transactions on Infor-mation Theory, 48(3):569–579, 2002.

[Ber09] Dennis S. Bernstein. Matrix mathematics. Princeton UniversityPress, Princeton, NJ, second edition, 2009. Theory, facts, andformulas.

[Bil95] Patrick Billingsley. Probability and measure. Wiley Series inProbability and Mathematical Statistics. John Wiley & SonsInc., New York, third edition, 1995. A Wiley-Interscience Pub-lication.

[BLM13] Stephane Boucheron, Gabor Lugosi, and Pascal Massart. Con-centration inequalities. Oxford University Press, Oxford, 2013.A nonasymptotic theory of independence, With a foreword byMichel Ledoux.

[BLT16] Pierre C. Bellec, Guillaume Lecue, and Alexandre B. Tsybakov.Slope meets lasso: Improved oracle bounds and optimality.arXiv preprint arXiv:1605.08651, 2016.

[Bre15] Guy Bresler. Efficiently learning Ising models on arbitrarygraphs. STOC’15—Proceedings of the 2015 ACM Symposiumon Theory of Computing, pages 771–782, 2015.

[BRT09] Peter J. Bickel, Ya’acov Ritov, and Alexandre B. Tsybakov. Si-multaneous analysis of Lasso and Dantzig selector. Ann. Statist.,37(4):1705–1732, 2009.

[BT09] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J.Imaging Sci., 2(1):183–202, 2009.

157

Bibliography 158

[BV04] Stephen Boyd and Lieven Vandenberghe. Convex optimization.Cambridge University Press, Cambridge, 2004.

[BvS+15] Ma lgorzata Bogdan, Ewout van den Berg, Chiara Sabatti, Wei-jie Su, and Emmanuel J. Candes. SLOPE—adaptive variableselection via convex optimization. The annals of applied statis-tics, 9(3):1103, 2015.

[Cav11] Laurent Cavalier. Inverse problems in statistics. In Inverse prob-lems and high-dimensional estimation, volume 203 of Lect. NotesStat. Proc., pages 3–96. Springer, Heidelberg, 2011.

[CL11] T. Tony Cai and Mark G. Low. Testing composite hypothe-ses, hermite polynomials and optimal estimation of a nonsmoothfunctional. Ann. Statist., 39(2):1012–1041, 04 2011.

[CT06] Thomas M. Cover and Joy A. Thomas. Elements of informationtheory. Wiley-Interscience [John Wiley & Sons], Hoboken, NJ,second edition, 2006.

[CT07] Emmanuel Candes and Terence Tao. The Dantzig selector: sta-tistical estimation when p is much larger than n. Ann. Statist.,35(6):2313–2351, 2007.

[CZ12] T. Tony Cai and Harrison H. Zhou. Minimax estimation of largecovariance matrices under `1-norm. Statist. Sinica, 22(4):1319–1349, 2012.

[CZZ10] T. Tony Cai, Cun-Hui Zhang, and Harrison H. Zhou. Opti-mal rates of convergence for covariance matrix estimation. Ann.Statist., 38(4):2118–2144, 2010.

[DDGS97] M.J. Donahue, C. Darken, L. Gurvits, and E. Sontag. Ratesof convex approximation in non-hilbert spaces. ConstructiveApproximation, 13(2):187–220, 1997.

[EHJT04] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tib-shirani. Least angle regression. Ann. Statist., 32(2):407–499,2004. With discussion, and a rejoinder by the authors.

[FHT10] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Reg-ularization paths for generalized linear models via coordinatedescent. Journal of Statistical Software, 33(1), 2010.

[FR13] Simon Foucart and Holger Rauhut. A mathematical introduc-tion to compressive sensing. Applied and Numerical HarmonicAnalysis. Birkhauser/Springer, New York, 2013.

[Gro11] David Gross. Recovering low-rank matrices from few coeffi-cients in any basis. IEEE Transactions on Information Theory,57(3):1548–1566, 2011.

Bibliography 159

[Gru03] Branko Grunbaum. Convex polytopes, volume 221 of GraduateTexts in Mathematics. Springer-Verlag, New York, second edi-tion, 2003. Prepared and with a preface by Volker Kaibel, VictorKlee and Gunter M. Ziegler.

[GVL96] Gene H. Golub and Charles F. Van Loan. Matrix computa-tions. Johns Hopkins Studies in the Mathematical Sciences.Johns Hopkins University Press, Baltimore, MD, third edition,1996.

[HTF01] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Theelements of statistical learning. Springer Series in Statistics.Springer-Verlag, New York, 2001. Data mining, inference, andprediction.

[Joh11] Iain M. Johnstone. Gaussian estimation: Sequence and waveletmodels. Unpublished Manuscript., December 2011.

[KLT11] Vladimir Koltchinskii, Karim Lounici, and Alexandre B. Tsy-bakov. Nuclear-norm penalization and optimal rates for noisylow-rank matrix completion. Ann. Statist., 39(5):2302–2329,2011.

[Lau96] Steffen L. Lauritzen. Graphical models, volume 17 of Oxford Sta-tistical Science Series. The Clarendon Press, Oxford UniversityPress, New York, 1996. Oxford Science Publications.

[Lie73] Elliott H. Lieb. Convex trace functions and the Wigner-Yanase-Dyson conjecture. Advances in Mathematics, 11(3):267–288,1973.

[LPTVDG11] K. Lounici, M. Pontil, A.B. Tsybakov, and S. Van De Geer.Oracle inequalities and optimal inference under group sparsity.Ann. Statist., 39(4):2164–2204, 2011.

[LWo12] Po-Ling Loh, Martin J. Wainwright, and others. Structure es-timation for discrete graphical models: Generalized covariancematrices and their inverses. In NIPS, pages 2096–2104, 2012.

[Mal09] Stephane Mallat. A wavelet tour of signal processing. Else-vier/Academic Press, Amsterdam, third edition, 2009. Thesparse way, With contributions from Gabriel Peyre.

[Mar52] Harry Markowitz. Portfolio selection. The journal of finance,7(1):77–91, 1952.

[Nem00] Arkadi Nemirovski. Topics in non-parametric statistics. In Lec-tures on probability theory and statistics (Saint-Flour, 1998),volume 1738 of Lecture Notes in Math., pages 85–277. Springer,Berlin, 2000.

Bibliography 160

[Pis81] G. Pisier. Remarques sur un resultat non publie de B. Maurey.In Seminar on Functional Analysis, 1980–1981, pages Exp. No.V, 13. Ecole Polytech., Palaiseau, 1981.

[Rig06] Philippe Rigollet. Adaptive density estimation using the block-wise Stein method. Bernoulli, 12(2):351–370, 2006.

[Rig12] Philippe Rigollet. Kullback-Leibler aggregation and misspecifiedgeneralized linear models. Ann. Statist., 40(2):639–665, 2012.

[Riv90] Theodore J. Rivlin. Chebyshev Polynomials: From Approxima-tion Theory to Algebra and Number Theory. Wiley, July 1990.

[RT11] Philippe Rigollet and Alexandre Tsybakov. Exponential screen-ing and optimal rates of sparse estimation. Ann. Statist.,39(2):731–771, 2011.

[Rus02] Mary Beth Ruskai. Inequalities for quantum entropy: A reviewwith conditions for equality. Journal of Mathematical Physics,43(9):4358–4375, 2002.

[RWL10] Pradeep Ravikumar, Martin J. Wainwright, and John D.Lafferty. High-dimensional Ising model selection using `1-regularized logistic regression. Ann. Statist., 38(3):1287–1319,2010.

[Sha03] Jun Shao. Mathematical statistics. Springer Texts in Statistics.Springer-Verlag, New York, second edition, 2003.

[Tib96] Robert Tibshirani. Regression shrinkage and selection via thelasso. J. Roy. Statist. Soc. Ser. B, 58(1):267–288, 1996.

[Tro12] Joel A. Tropp. User-friendly tail bounds for sums of random ma-trices. Foundations of computational mathematics, 12(4):389–434, 2012.

[Tsy03] Alexandre B. Tsybakov. Optimal rates of aggregation. In Bern-hard Scholkopf and Manfred K. Warmuth, editors, COLT, vol-ume 2777 of Lecture Notes in Computer Science, pages 303–313.Springer, 2003.

[Tsy09] Alexandre B. Tsybakov. Introduction to nonparametric estima-tion. Springer Series in Statistics. Springer, New York, 2009.Revised and extended from the 2004 French original, Translatedby Vladimir Zaiats.

[Ver18] Roman Vershynin. High-Dimensional Probability. CambridgeUniversity Press (to appear), 2018.

[vH17] Ramon van Handel. Probability in high dimension. LectureNotes (Princeton University), 2017.

Bibliography 161

[VMLC16] Marc Vuffray, Sidhant Misra, Andrey Lokhov, and MichaelChertkov. Interaction screening: Efficient and sample-optimallearning of ising models. In D. D. Lee, M. Sugiyama, U. V.Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neu-ral Information Processing Systems 29, pages 2595–2603. CurranAssociates, Inc., 2016.

high dimensional statistics - mit mathematicsrigollet/pdfs/rignotes17.pdf · high-dimensional...

Documents