apts high-dimensional statistics - warwick

APTS High-dimensional Statistics

Rajen D. [email protected]

July 11, 2019

Contents

1 Introduction 2

1.1 High-dimensional data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Classical statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Shortcomings of classical statistics . . . . . . . . . . . . . . . . . . . . . . . 41.5 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Ridge regression and kernel machines 5

2.1 The SVD and PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 The kernel trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 The representer theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 The Lasso 12

3.1 The Lasso estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Prediction error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2.1 The event ⌦ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.3 Estimation error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.4 Subgradients and the KKT conditions . . . . . . . . . . . . . . . . . . . . 183.5 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.6 Extensions of the Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.6.1 Generalised linear models . . . . . . . . . . . . . . . . . . . . . . . 213.6.2 Removing the bias of the Lasso . . . . . . . . . . . . . . . . . . . . 213.6.3 The elastic net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.6.4 Group Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.6.5 Fused Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1

4 Graphical modelling 22

4.1 Conditional independence graphs . . . . . . . . . . . . . . . . . . . . . . . 234.2 Gaussian graphical models . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2.1 Neighbourhood selection . . . . . . . . . . . . . . . . . . . . . . . . 244.2.2 Schur complements and the precision matrix . . . . . . . . . . . . . 254.2.3 The Graphical Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 High-dimensional inference 26

5.1 Family-wise error rate control . . . . . . . . . . . . . . . . . . . . . . . . . 275.2 The False Discovery Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.3 Hypothesis testing in high-dimensional regression . . . . . . . . . . . . . . 30

1 Introduction

1.1 High-dimensional data

Over the last 25 years, the sorts of datasets that statisticians have been challenged tostudy have changed greatly. Where in the past, we were used to datasets with manyobservations with a few carefully chosen variables, we are now seeing datasets where thenumber of variables run into the thousands and even exceed the number of observations.For example, the field of genomics routinely encounters datasets where the number ofvariables or predictors can be in the tens of thousands or more. Classical statistical methodsare often simply not applicable in these “high-dimensional” situations. Designing methodsthat can cope with these challenging settings has been and continues to be one of the mostactive areas of research in statistics. Before we dive into the details of these methods, itwill be helpful to review some results from classical statistical theory to set the scene forthe more modern methods to follow.

1.2 Classical statistics

Consider response–covariate pairs (Yi,xi) 2 R ⇥ Rp, i = 1, . . . , n. A linear model for thedata assumes that it is generated according to

Y = X�0 + ", (1.1)

where Y 2 Rn is the vector of responses; X 2 Rn⇥p is the predictor matrix (or designmatrix) with ith row x

T

i; " 2 Rn represents random error; and �0 2 Rp is the unknown

vector of coe�cients.When p ⌧ n, a sensible way to estimate �0 is by ordinary least squares (OLS). This

yields an estimator �OLS

with

�OLS

:= argmin�2Rp

kY �X�k22= (XT

X)�1X

TY, (1.2)

2

provided X has full column rank.

Under the assumptions that E(") = 0 and Var(") = �2I, �

OLS

is unbiased and hasvariance �2(XT

X)�1. How small is this variance? The Gauss–Markov theorem states thatOLS is the best linear unbiased estimator in this setting: for any other estimator � thatis linear in Y (so � = AY for some fixed matrix A), we have

Var�0,�2(�)� Var�0

,�2(�OLS

)

is positive semi-definite.

1.3 Maximum likelihood estimation

The method of least squares is just one way to construct as estimator. A more generaltechnique is that of maximum likelihood estimation. Here given data y 2 Rn that we takeas a realisation of a random variable Y, we specify its density f(y;✓) up to some unknownvector of parameters ✓ 2 ⇥ ✓ Rd, where ⇥ is the parameter space. The likelihood functionis a function of ✓ for each fixed y given by

L(✓) := L(✓;y) = c(y)f(y;✓),

where c(y) is an arbitrary constant of proportionality. The maximum likelihood estimateof ✓ maximises the likelihood, or equivalently it maximises the log-likelihood

`(✓) := `(✓;y) = log f(y;✓) + log(c(y)).

A very useful quantity in the context of maximum likelihood estimation is the Fisher

information matrix with jkth (1 j, k d) entry

ijk(✓) := �E✓

⇢@2

@✓j@✓k`(✓)

�.

It can be thought of as a measure of how hard it is to estimate ✓ when it is the trueparameter value. The Cramer–Rao lower bound states that if ✓ is an unbiased estimatorof ✓, then under regularity conditions,

Var✓(✓)� i�1(✓)

is positive semi-definite.A remarkable fact about maximum likelihood estimators (MLEs) is that (under quite

general conditions) they are asymptotically normally distributed, asymptotically unbiasedand asymptotically achieve the Cramer–Rao lower bound.

Assume that the Fisher information matrix when there are n independent observations,i(n)(✓) (where we have made the dependence on n explicit) satisfies i(n)(✓)/n ! I(✓) forsome positive definite matrix I. Then denoting the maximum likelihood estimator of

3

✓ when there are n observations by ✓(n)

, under regularity conditions, as the number ofobservations n ! 1 we have

pn(✓

(n)

� ✓)d! Nd(0, I

�1(✓)).

Returning to our linear model, if we assume in addition that " ⇠ Nn(0, �2I), then the

log-likelihood for (�, �2) is

`(�, �2) = �n

2log(�2)� 1

2�2

nX

i=1

(Yi � xT

i�)2.

We see that the maximum likelihood estimate of � and OLS coincide. You can check that

i(�, �2) =

✓��2

XTX 0

0 n��4/2

◆.

The general theory for MLEs would suggest that approximately

pn(� � �0) ⇠ Np(0, n�

2(XTX)�1);

in fact it is straightforward to show that this distributional result is exact.

1.4 Shortcomings of classical statistics

We have seen that the classical statistical methods of OLS and maximum likelihood es-timation enjoy important optimality properties and come ready with a framework forperforming inference. However, these methods do have their limitations.

• A crucial part of the optimality arguments for OLS and MLEs was unbiasedness. Dothere exist biased methods whose variance is is reduced compared to OLS such thattheir overall prediction error is lower? Yes!—in fact the use of biased estimators isessential in dealing with settings where the number of parameters to be estimated islarge compared to the number of observations.

• The asymptotic statements concerning MLEs may not be relevant in many practicalsettings. These statements concern what happens as n ! 1 whilst p is kept fixed.In practice we often find that datasets with a large n also have a large p. Thus amore relevant asymptotic regime may be to consider p increasing with n at somerate; in these settings the optimality properties of MLEs will typically fail to hold.

1.5 Notation

Given A,B ✓ {1, . . . , p}, and x 2 Rp, we will write xA for the sub-vector of x formed fromthose components of x indexed by A. Similarly, we will write MA for the submatrix ofM formed from those columns of M indexed by A. Further, MA,B will be the submatrix

4

of M formed from columns and rows indexed by A and B respectively. For example,x{1,2} = (x1, x2)T , M{1,2} is the matrix formed from the first two columns of M, andM{1,2},{1,2} is the top left 2⇥ 2 submatrix of M.

In addition, when used in subscripts, we will use �j and �jk to denote {1, . . . , p} \{j} := {j}c and {1, . . . , p} \ {j, k} := {j, k}c respectively. So for example, M�jk is thesubmatrix of M that has columns j and k removed.

2 Ridge regression and kernel machines

Let us revisit the linear modelYi = x

T

i�0 + "i

where E("i) = 0 and Var(") = �2I. For unbiased estimators of �0, their variance gives

a way of comparing their quality in terms of squared error loss. For a potentially biasedestimator, �, the relevant quantity is

E�0,�2{(� � �0)(� � �0)T} = E[{� � E(�) + E(�)� �0}{� � E(�) + E(�)� �0}T ]

= Var(�) + {E(� � �0)}{E(� � �0)}T ,

a sum of squared bias and variance terms. In this section and the next, we will exploretwo important methods for variance reduction based on di↵erent forms of penalisation:rather than forming estimators via optimising a least squares or log-likelihood term, wewill introduce an additional penalty term that encourages estimates to be shrunk towards0 in some sense. This will allow us to produce reliable estimators that work well whenclassical MLEs are infeasible, and in other situations can greatly out-perform the classicalapproaches.

Ridge regression [Hoerl and Kennard, 1970] performs shrinkage by solving the followingoptimisation problem

(µR

�, �

R

�) = argmin

(µ,�)2R⇥Rp

{kY � µ1�X�k22+ �k�k2

2}.

Here 1 is an n-vector of 1’s. We see that the usual OLS objective is penalised by anadditional term proportional to k�k2

2. The parameter � � 0, which controls the severity of

the penalty and therefore the degree of the shrinkage towards 0, is known as a regularisationparameter or tuning parameter. Note we have explicitly included an intercept term which isnot penalised. The reason for this is that were the variables to have their origins shifted soe.g. a variable representing temperature is given in units of Kelvin rather than Celsius, thefitted values would not change. However, X� is not invariant under scale transformationsof the variables so it is standard practice to centre each column of X (hence making themorthogonal to the intercept term) and then scale them to have `2-norm

pn.

The following lemma shows that after this standardisation of X, µR

�= Y :=

Pn

i=1Yi/n,

so we may assume thatP

n

i=1Yi = 0 by replacing Yi by Yi � Y and then we can remove µ

from our objective function.

5

Lemma 1. Suppose the columns of X have been centred. If a minimiser (µ, �) of

kY � µ1�X�k22+ J(�).

over (µ,�) 2 R⇥ Rpexists, then µ = Y.

Once these modifications have been made, di↵erentiating the resulting objective withrespect to � yields

�R

�= (XT

X+ �I)�1X

TY.

In this form, we can see how the addition of the �I term helps to stabilise the estimator.Note that when X does not have full column rank (such as in high-dimensional situations),

�R

�can still be computed. On the other hand, when X does have full column rank and

assuming a linear model, we have the following theorem.

Theorem 2. For � su�ciently small (depending on �0and �2

),

E(�OLS

� �0)(�OLS

� �0)T � E(�R

�� 0)(�

R

�� 0)T

is positive definite.

At first sight this might appear to contradict the Gauss–Markov theorem. However,though ridge regression is a linear estimator, it is not unbiased. In order to better under-stand the performance of ridge regression, we will turn to one of the key matrix decompo-sitions used in statistics.

2.1 The SVD and PCA

Recall that we can factorise any X 2 Rn⇥p into its SVD

X = UDVT .

Here the U 2 Rn⇥n and V 2 Rp⇥p are orthogonal matrices and D 2 Rn⇥p has D11 � D22 �· · · � Dmm � 0 where m := min(n, p) and all other entries of D are zero.

When n > p, we can replaceU by its first p columns andD by its first p rows to produceanother version of the SVD (sometimes known as the thin SVD). Then X = UDV

T whereU 2 Rn⇥p has orthonormal columns (but is no longer square) andD is square and diagonal.There is an equivalent version for when p > n.

Let us take X 2 Rn⇥p as our matrix of predictors and suppose n � p. Using the (thin)SVD we may write the fitted values from ridge regression as follows.

X�R

�= X(XT

X+ �I)�1X

TY

= UDVT (VD

2V

T + �I)�1VDU

TY

= UD(D2 + �I)�1DU

TY

=pX

j=1

Uj

D2

jj

D2

jj+ �

UT

jY.

6

Here we have used the notation (which we shall use throughout the course) that Uj is thejth column of U. For comparison, the fitted values from OLS (when X has full columnrank) are

X�OLS

= X(XTX)�1

XTY = UU

TY.

Both OLS and ridge regression compute the coordinates of Y with respect to the columnsof U. Ridge regression then shrinks these coordinates by the factors D2

jj/(D2

jj+ �); if Djj

is small, the amount of shrinkage will be larger.To interpret this further, note that the SVD is intimately connected with Principal

Component Analysis (PCA). Consider v 2 Rp with kvk2 = 1. Since the columns of Xhave had their means subtracted, the sample variance of Xv 2 Rn, is

1

nvTX

TXv =

1

nvTVD

2V

Tv.

Writing a = VTv, so kak2 = 1, we have

1

nvTVD

2V

Tv =

1

naTD

2a =

1

n

X

j

a2jD2

jj 1

nD11

X

j

a2j=

1

nD2

11.

As kXV1k22/n = D2

11/n, V1 determines the linear combination of the columns of X that

has the largest sample variance, when the coe�cients of the linear combination are con-strained to have `2-norm 1. XV1 = D11U1 is known as the first principal component ofX. Subsequent principal components D22U2, . . . , DppUp have maximum variance D2

jj/n,

subject to being orthogonal to all earlier ones.Returning to ridge regression, we see that it shrinks Y most in the smaller principal

components of X. Thus it will work well when most of the signal is in the large principalcomponents of X. We now turn to the problem of choosing �.

2.2 Cross-validation

Cross-validation is a general technique for selecting a good regression method from amongseveral competing regression methods. We illustrate the principle with ridge regression,where we have a family of regression methods given by di↵erent � values.

So far, we have considered the matrix of predictors X as fixed and non-random. How-ever, in many cases, it makes sense to think of it as random. Let us assume that our dataare i.i.d. pairs (xi,Yi), i = 1, . . . , n. Then ideally, we might want to pick a � value suchthat

E{(Y ⇤ � x⇤T �

R

�(X,Y))2|X,Y} (2.1)

is minimised. Here (x⇤, Y ⇤) 2 Rp⇥R is independent of (X,Y) and has the same distribution

as (x1,Y1), and we have made the dependence of �R

�on the training data (X,Y) explicit.

This � is such that conditional on the original training data, it minimises the expectedprediction error on a new observation drawn from the same distribution as the trainingdata.

7

A less ambitious goal is to find a � value to minimise the expected prediction error,

E[E{(Y ⇤ � x⇤T �

R

�(X,Y))2|X,Y}] (2.2)

where compared with (2.1), we have taken a further expectation over the training set.We still have no way of computing (2.2) directly, but we can attempt to estimate it.

The idea of v-fold cross-validation is to split the data into v groups or folds of roughly equalsize Let (X(�k),Y(�k)) be all the data except that in the kth fold, and let Ak ⇢ {1, . . . , n}be the observation indices corresponding to the kth fold. For each � on a grid of values,

we compute �R

�(X(�k),Y(�k)): the ridge regression estimate based on all the data except

the kth fold. We choose the value of � that minimises

CV(�) =1

n

vX

k=1

X

i2Ak

{Yi � xT

i�

R

�(X(�k),Y(�k))}2. (2.3)

Writing �CV for the minimiser, our final estimate of �0 can then be �R

�CV(X,Y).

Note that for each i 2 Ak,

E{Yi � xT

i�

R

�(X(�k),Y(�k))}2 = E[E{Yi � x

T

i�

R

�(X(�k),Y(�k))}2|X(�k),Y(�k)]. (2.4)

This is precisely the expected prediction error in (2.2) but with the training data X,Yreplaced with a training data set of smaller size. If all the folds have the same size, thenCV(�) is an average of n identically distributed quantities, each with expected value as in(2.4). However, the quantities being averaged are not independent as they share the samedata.

Thus cross-validation gives a biased estimate of the expected prediction error. Theamount of the bias depends on the size of the folds, the case when the v = n giving theleast bias—this is known as leave-one-out cross-validation. The quality of the estimate,though, may be worse as the quantities being averaged in (2.3) will be highly positivelycorrelated. Typical choices of v are 5 or 10.

Cross-validation aims to allow us to choose the single best � (or more generally regres-sion procedure); we could instead aim to find the best weighted combination of regressionprocedures. Returning to our ridge regression example, suppose � is restricted to a grid ofvalues �1 > �2 > · · · > �L. We can then minimise

1

n

vX

k=1

X

i2Ak

⇢Yi �

LX

l=1

wlxT

i�

R

�l(X(�k),Y(�k))

�2

over w 2 RL subject to wl � 0 for all l. This is a non-negative least-squares optimisation,for which e�cient algorithms are available. This sort of idea is known as stacking [Wolpert,1992, Breiman, 1996] and it can often outperform cross-validation.

8

2.3 The kernel trick

The fitted values from ridge regression are

X(XTX+ �I)�1

XTY. (2.5)

An alternative way of writing this is suggested by the following

XT (XX

T + �I) = (XTX+ �I)XT

(XTX+ �I)�1

XT = X

T (XXT + �I)�1

X(XTX+ �I)�1

XTY = XX

T (XXT + �I)�1

Y. (2.6)

Two remarks are in order:

• Note while XTX is p ⇥ p, XX

T is n ⇥ n. Computing fitted values using (2.5)would require roughly O(np2 + p3) operations. If p � n this could be extremelycostly. However, our alternative formulation would only require roughly O(n2p+n3)operations, which could be substantially smaller.

• We see that the fitted values of ridge regression depend only on inner productsK = XX

T between observations (note Kij = xT

ixj).

Now suppose that we believe the signal depends quadratically on the predictors:

Yi = xT

i� +

X

k,l

xikxil✓kl + "i.

We can still use ridge regression provided we work with an enlarged set of predictors

xi1, . . . , xip, xi1xi1, . . . , xi1xip, xi2xi1, . . . , xi2xip, . . . , xipxip. (2.7)

This will give us O(p2) predictors. Our new approach to computing fitted values wouldtherefore have complexity O(n2p2 + n3), which could still be rather costly if p is large.

However, rather than first creating all the additional predictors and then computingthe new K matrix, we can attempt to directly compute K. To this end consider

✓1

2+ x

T

ixj

◆2

� 1

4=

✓1

2+X

k

xikxjk

◆2

� 1

4(2.8)

=X

k

xikxjk +X

k,l

xikxilxjkxjl.

Observe this amounts to an inner product between vectors of the form (2.7). Thus if weset

Kij = (1/2 + xT

ixj)

2 � 1/4

and plug this into the formula for the fitted values, it is exactly as if we had performedridge regression on an enlarged set of variables given by (2.7). Now computing K using

9

(2.8) would require only p operations per entry, so O(n2p) operations in total. It thusseems we have improved things by a factor of p using our new approach. This is a nicecomputational trick, but more importantly for us it serves to illustrate some general points.Since ridge regression only depends on inner products between observations, rather thanfitting non-linear models by first mapping the original data xi 2 Rp to �(xi) 2 Rd (say)using some feature map � (which could, for example introduce quadratic e↵ects), we caninstead try to directly compute k(xi,xj) = h�(xi),�(xj)i.

In fact, rather than thinking in terms of feature maps, we can instead try to thinkabout an appropriate measure of similarity k(xi,xj) between observations. It turns outthat a necessary and su�cient property for such a similarity measure to have in order for itto correspond to an inner product of the form k(xi,xj) = h�(xi),�(xj)i may be formalisedin the following definition.

Definition 1. Given a non-empty set X (the input space), a positive definite kernel k :X ⇥X ! R is a symmetric map with the property that given any collection of m potentialobservation vectors x1, . . . ,xm 2 X , the matrix K 2 Rm⇥m with entries

Kij = k(xi,xj)

is positive semi-definite.

Modelling by directly picking a kernel with this positive-definiteness property is some-times easier and more natural than thinking in terms of feature maps. Note that the inputspace X need not be Rd and so working with kernels gives us a way of performing regressionon complex non-Euclidean data. Below are some examples of popular kernels.

Polynomial kernel. k(x,x0) = (a+ xTx0)r.

Gaussian kernel.

k(x, x0) = exp

✓� kx� x

0k22

2�2

◆.

For x close to x0 it is large whilst for x far from x

0 the kernel quickly decays towards 0.The additional parameter �2 known as the bandwidth controls the speed of the decay tozero. Note it is less clear how one might find a corresponding feature map and indeed anyfeature map that represents this must be infinite dimensional.

Jaccard similarity kernel. Take X to be the set of all subsets of {1, . . . , p}. Forx, x0 2 X with x [ x0 6= ; define

k(x, x0) =|x \ x0||x [ x0|

and if x [ x0 = ; then set k(x, x0) = 1.

10

2.4 The representer theorem

Ridge regression is just one of many procedures that depends only on inner productsbetween observation vectors. Consider mapping our original data xi 2 Rp to �(xi) 2 Rd

(e.g. to include quadratic terms) and let � 2 Rn⇥d be the matrix with ith row �(xi). Letpositive definite kernel k be given by k(x,x0) = h�(x),�(x0)i. We now explain why anymethod involving optimising an objective of the form

c(Y,��) + �k�k22

(2.9)

over � 2 Rd only depends on k provided it is predictions at new or existing data pointsthat are of interest rather than the minimiser � itself. Let P 2 Rd⇥d be the orthogonalprojection on to the row space of � and note that �� = �P�. Meanwhile

k�k22= kP�k2

2+ k(I�P)�k2

2.

We conclude that any minimiser � must satisfy � = P�, that is � must be in the rowspace of �. This means that we may write � = �

T ↵ for some ↵ 2 Rn. Let K 2 Rn⇥n

be the matrix with ijth entry Kij = k(xi,xj), so K = ��T . Substituting � = �

T↵ into(2.9), we see that ↵ minimises

c(Y,K↵) + �↵TK↵ (2.10)

over ↵ 2 Rn. Note that the predicted value at a new data point x is

�(x)T � = �(x)T�T ↵ =nX

i=1

k(x,xi)↵i.

What is remarkable is that whilst the optimisation in (2.9) involves d variables, we haveshown this is equivalent to (2.10) which involves n variables: this is a substantial simplifi-cation if d � n. In fact, this result, which is known as the representer theorem [Kimeldorfand Wahba, 1970, Scholkopf et al., 2001], can be generalised to the case where the innerproduct space that � maps is infinite-dimensional.

A key application of the representer theorem is to the famous support vector classifier

which is commonly used in the classification setting when the response is binary so Y 2{�1, 1}n. The optimisation problem can be expressed as

nX

i=1

(1� YixT

i�)+ + �k�k2

2,

which we see is of the form (2.9). Note here (x)+ = x {x>0}.This section has been the briefest of introductions to the world of kernel machines

which form an important class of machine learning methods. If you are interested inlearning more, you can try the survey paper Hofmann et al. [2008], or Scholkopf andSmola [2001] for a more in depth treatment.

11

3 The Lasso

Let us revisit the linear model Y = X�0 + " where E(") = 0, Var(") = �2I. In many

modern datasets, there are reasons to believe there are many more variables present thanare necessary to explain the response. The MSPE of OLS is

1

nEkX�0 �X�

OLS

k22=

1

nE{(�0 � �

OLS

)TXTX(�0 � �

OLS

)}

=1

nE[tr{(�0 � �

OLS

)(�0 � �OLS

)TXTX}]

=1

ntr[E{(�0 � �

OLS

)(�0 � �OLS

)T}XTX]

=1

ntr(Var(�

OLS

)XTX) =

p

n�2.

Let S be the set S = {k : �0

k6= 0} and suppose s := |S| ⌧ p. If we could identify S and

then fit a linear model using just these variables, we’d obtain an MSPE of �2s/n whichcould be substantially smaller than �2p/n. Furthermore, it can be shown that parameterestimates from the reduced model are more accurate. The smaller model would also beeasier to interpret.

We now briefly review some classical model selection strategies.

Best subset regression. A natural approach to finding S is to consider all 2p pos-sible regression procedures each involving regressing the response on a di↵erent sets ofexplanatory variables XM where M is a subset of {1, . . . , p}. We can then pick the bestregression procedure using cross-validation (say). For general design matrices, this involvesan exhaustive search over all subsets, so this is not really feasible for p > 50.

Forward selection. This can be seen as a greedy way of performing best subsets regres-sion. Given a target model size m (the tuning parameter), this works as follows.

1. Start by fitting an intercept only model.

2. Add to the current model the predictor variable that reduces the residual sum ofsquares the most.

3. Continue step 2 until m predictor variables have been selected.

3.1 The Lasso estimator

The Least absolute shrinkage and selection operator (Lasso) [Tibshirani, 1996] estimates

�0 by �L

�, where (µL, �

L

�) minimise

1

2nkY � µ1�X�k2

2+ �k�k1 (3.1)

12

over (µ,�) 2 R⇥ Rp. Here k�k1 is the `1-norm of �: k�k1 =P

p

k=1|�k|.

Like ridge regression, �L

�shrinks the OLS estimate towards the origin, but there is

an important di↵erence. The `1 penalty can force some of the estimated coe�cients to beexactly 0. In this way the Lasso can perform simultaneous variable selection and parameterestimation. As we did with ridge regression, we can centre and scale the X matrix, andalso centre Y and thus remove µ from the objective (see Lemma 1). Define

Q�(�) =1

2nkY �X�k2

2+ �k�k1. (3.2)

Now any minimiser of Q�(�) will also be a minimiser of

kY �X�k22subject to k�k1 k�

L

�k1.

Similarly, in the case of the ridge regression objective, we know that �R

�minimises kY �

X�k22subject to k�k2 k�

R

�k2.

The contours of the OLS objective kY�X�k22are ellipsoids centred at �

OLS

, while thecontours of k�k2

2are spheres centred at the origin, and the contours of k�k1 are ‘diamonds’

centred at 0.The important point to note is that the `1 ball {� 2 Rp : k�k1 k�

L

�k1} has corners

where some of the components are zero, and it is likely that the OLS contours will intersectthe `1 ball at such a corner.

3.2 Prediction error

A remarkable property of the Lasso is that even when p � n, it can still perform well interms of prediction error. Suppose the columns of X have been centred and scaled (as wewill always assume from now on unless stated otherwise) and a linear model holds:

Y = X�0 + "� "1. (3.3)

Note we have already centred Y. We will further assume that the "i are independentmean-zero sub-Gaussian random variables with common parameter �. Note this includesas a special case " ⇠ Nn(0, �2

I).

13

Theorem 3. Let � be the Lasso solution when

� = A�

rlog(p)

n.

With probability at least 1� 2p�(A2/2�1)

1

nkX(�0 � �)k2

2 4A�

rlog(p)

nk�0k1.

Proof. From the definition of � we have

1

2nkY �X�k2

2+ �k�k1

1

2nkY �X�0k2

2+ �k�0k1.

Rearranging,

1

2nkX(�0 � �)k2

2 1

n"TX(� � �0) + �k�0k1 � �k�k1.

Now |"TX(� � �0)| kXT"k1k� � �0k1 by Holder’s inequality. Let ⌦ = {kXT"k1/n �}. Lemma 6 below shows that P(⌦) � 1�2p�(A

2/2�1). Working on the event ⌦, we obtain

1

2nkX(�0 � �)k2

2 �k�0 � �k1 + �k�0k1 � �k�k1,

1

nkX(�0 � �)k2

2 4�k�0k1, by the triangle inequality.

3.2.1 The event ⌦

The proof of Theorem 3 relies on a lower bound for the probability of the event ⌦. A unionbound gives

P(kXT"k1/n > �) = P([p

j=1|XT

j"|/n > �)

pX

j=1

P(|XT

j"|/n > �).

Thus if we can obtain tail probability bounds on |XT

j"|/n, the argument above will give a

lower bound for P(⌦).Recall a random variable is said to be sub-Gaussian with parameter � if

Ee↵(W�EW ) e↵2�2/2

for all ↵ 2 R (see Section 6 of the preliminary material). This property implies the tailbound

P(W � EW � t) e�t2/(2�

2).

If W ⇠ N (0, �2), then W is sub-Gaussian with parameter �. The sub-Gaussian class alsoincludes bounded random variables:

14

Lemma 4 (Hoe↵ding’s lemma). If W is mean-zero and takes values in [a, b], then W is

sub-Gaussian with parameter (b� a)/2.

The following proposition shows that analogously to how a linear combination of jointlyGaussian random variables is Gaussian, a linear combination of sub-Gaussian randomvariables is also sub-Gaussian.

Proposition 5. Let (Wi)ni=1be a sequence of independent mean-zero sub-Gaussian ran-

dom variables with parameters (�i)ni=1and let � 2 Rn

. Then �TW is sub-Gaussian with

parameter

⇣Pi�2

i�2

i

⌘1/2

.

Proof.

E exp⇣↵

nX

i=1

�iWi

⌘=

nY

i=1

E exp(↵�iWi)

nY

i=1

exp(↵2�2

i�2

i/2)

= exp⇣↵2

nX

i=1

�2

i�2

i/2⌘.

We are now in a position to obtain the probability bound required for Theorem 3.

Lemma 6. Suppose ("i)ni=1are independent, mean-zero and sub-Gaussian with common

parameter �. Let � = A�plog(p)/n. Then

P(kXT"k1/n �) � 1� 2p�(A2/2�1).

Proof.

P(kXT"k1/n > �) pX

j=1

P(|XT

j"|/n > �).

But ±XT

j"/n are both sub-Gaussian with parameter (�2kXjk22/n2)1/2 = �/

pn. Thus the

RHS is at most2p exp(�A2 log(p)/2) = 2p1�A

2/2.

3.3 Estimation error

Consider once more the model Y = X�0 + " � "1 where the components of " are inde-pendent mean-zero sub-Gaussian random variables with common parameter �. Let S ands = |S| be as in the previous section and define N = Sc = {1, . . . , p} \ S. As we havenoted before, in an artificial situation where S is known, we could apply OLS on XS andhave an MSPE of �2s/n. Under a so-called compatibility condition on the design matrix,we can obtain a similar MSPE for the Lasso.

15

Definition 2. Given a matrix of predictors X 2 Rn⇥p and support set S, define

�2 = inf�2Rp:�S 6=0, k�Nk13k�Sk1

1

nkX�k2

2

1

sk�

Sk21

= inf�2Rp:k�Sk1=1, k�Nk13

s

nkXS�S

�XN�Nk22,

and we take � � 0. The compatibility condition is that �2 > 0.

Note that if XTX/n has minimum eigenvalue cmin > 0 (so necessarily p n), then

�2 > cmin. Indeed by the Cauchy–Schwarz inequality,

k�Sk1 = sgn(�

S)T�

S

psk�

Sk2

psk�k2.

Thus

�2 � inf� 6=0

1

nkX�k2

2

k�k22

= cmin.

Although in the high-dimensional setting we would have cmin = 0, the fact that the infimumin the definition of �2 is over a restricted set of � can still allow �2 to be positive even inthis case. For example, suppose the rows of X were drawn independently from a mean-zerosun-Gaussian distribution whose variance ⌃ has minimum eigenvalue greater than someconstant c > 0. In this case it can be shown that P(�2 > c/2) ! 1 if s

plog(p)/n ! 0 (in

fact stronger results are true for a Gaussian design [Raskutti et al., 2010]).

Theorem 7. Suppose the compatibility condition holds and let � be the Lasso solution with

� = A�plog(p)/n for A > 0. Then with probability at least 1� 2p�(A

2/8�1)

, we have

1

nkX(�0 � �)k2

2+ �k� � �0k1

12�2s

�2=

12A2 log(p)

�2

�2s

n.

Proof. As in Theorem 3 we start with the “basic inequality”:

1

2nkX(� � �0)k2

2+ �k�k1

1

n"TX(� � �0) + �k�0k1.

We work on the event ⌦ = {2kXT"k1/n �} where after applying Holder’s inequality,we get

1

nkX(� � �0)k2

2+ 2�k�k1 �k� � �0k1 + 2�k�0k1. (3.4)

Lemma 6 shows that P(⌦) � 1� 2p�(A2/8�1).

To motivate the rest of the proof, consider the following idea. We know

1

nkX(� � �0)k2

2 3�k� � �0k1.

If we could get

3�k� � �0k1 c�pnkX(� � �0)k2

16

for some constant c > 0, then we would have that kX(� � �0)k22/n c2�2 and also

3�k�0 � �k1 c2�2.Returning to the actual proof, write a = kX(� � �0)k2

2/(n�). Then from (3.4) we can

derive the following string of inequalities:

a+ 2(k�Nk1 + k�

Sk1) k�

S� �0

Sk1 + k�

Nk1 + 2k�0

Sk1

a+ k�Nk1 k�

S� �0

Sk1 + 2k�0

Sk1 � 2k�

Sk1

a+ k�N� �0

Nk1 3k�0

S� �

Sk1.

Now, using the compatibility condition with � = � � �0 we have

1

nkX(� � �0)k2

2+ �k�0

N� �

Nk1 3�k�0

S� �

Sk1

3�

�

rs

nkX(� � �0)k2.

From this we get1pnkX(� � �0)k2

3�ps

�,

which substituting into the RHS of the above gives

1

nkX(� � �0)k2

2+ �k�0

N� �

Nk1 9�2s/�2.

Adding � times the inequality

k�0

S� �

Sk1 3�s/�2 (3.5)

to both sides then gives the result.

Variable screening. The estimation error results shows in particular that if componentsof �0 are su�ciently far away from 0, the corresponding Lasso coe�cients will also be non-zero (and have the same sign). Thus the set of non-zeroes of the Lasso S� should containthe set of important predictors, though we may miss some variables in S that have smallcoe�cients.

Corollary 8. Consider the setup of Theorem 7. Let Simp = {j : |�0

j| > 3�sA

plog(p)/n/�2}.

With probability 1� 2p�(A2/8�1)

S� � Simp.

Proof. From (3.5) we know that on ⌦ we have (in particular) for j 2 Simp

|�0

j� �j| 3�sA

plog(p)/n/�2

which implies sgn(�j) = sgn(�0

j).

17

The conditions required for S� = S are somewhat stronger [Meinshausen and Buhlmann,2006, Zhao and Yu, 2006, Wainwright, 2009].

Note that the choice of � given in the results above involves � which is typically un-known. In practice � is usually selected by cross-validation (or some form of stacking maybe used). However Belloni et al. [2011], Sun and Zhang [2012] show that by modifying theLasso objective to remove the square on the least squares term kY�X�k2

2(known as the

square-root Lasso), there is a universal choice of � not depending on � for which the aboveresults hold.

3.4 Subgradients and the KKT conditions

E�cient computation of Lasso solutions makes use of a set of necessary and su�cientconditions for a vector to minimise the Lasso objective. These conditions are e↵ectivelyzero gradient conditions, but take account of the fact that the Lasso objective is notdi↵erentiable at any point where any �j = 0. In order to derive these conditions, wemust introduce the notion of a subgradeient, which generalises the gradient to potentiallynon-di↵erentiable but convex functions.

Definition 3. A vector v 2 Rd is a subgradient of a convex function f : Rd ! R at x if

f(y) � f(x) + vT (y � x) for all y 2 Rd.

The set of subgradients of f at x is called the subdi↵erential of f at x and denoted @f(x).

In order to make use of subgradients, we will require the following two facts:

Proposition 9. Let f : Rd ! R be convex, and suppose f is di↵erentiable at x. Then

@f(x) = {@f/@x}.

Proposition 10. Let f, g : Rd ! R be convex and let ↵ > 0. Then

@(↵f)(x) = ↵@f(x) = {↵v : v 2 @f(x)},@(f + g)(x) = @f(x) + @g(x) = {v +w : v 2 @f(x), w 2 @g(x)}.

The following easy (but key) result is often referred to in the statistical literature as theKarush–Kuhn–Tucker (KKT) conditions, though it is actually a much simplified versionof them.

Proposition 11. x⇤ 2 argmin

x2Rd

f(x) if and only if 0 2 @f(x⇤).

Proof.

f(y) � f(x⇤) for all y 2 Rd , f(y) � f(x⇤) + 0T (y � x) for all y 2 Rd

, 0 2 @f(x⇤).

18

Let us now compute the subdi↵erential of the `1-norm. First note that k · k1 : Rd ! Ris convex. Indeed it is a norm so the triangle inequality gives ktx + (1� t)yk1 tkxk1 +(1� t)kyk1.

Proposition 12. For x 2 Rdlet A = {j : xj 6= 0}. Then

@kxk1 = {v 2 Rd : kvk1 1 and vA = sgn(xA)}

Proof. For j = 1, . . . , d, let

gj : Rd ! Rx 7! |xj|.

Then k · k =P

jgj(·) so by Proposition 10, @kxk1 =

Pj@gj(x). When x has xj 6= 0, gj

is di↵erentiable at x so by Proposition 9 @gj(x) = {sgn(xj)ej} where ej is the jth unitvector. When xj = 0, if v 2 @gj(x) we must have

gj(y) � gj(x) + vT (y � x) for all y 2 Rd,

so|yj| � v

T (y � x) for all y 2 Rd. (3.6)

we claim that the above holds if and only if vj 2 [�1, 1] and v�j = 0. For the ‘if’ direction,note that vT (y � x) = vjyj |yj|. Conversely, set y�j = x�j + v�j and yj = 0 in (3.6) toget 0 � kv�jk22, so v�j = 0. Then take y with y�j = x�j to get |yj| � vjyj for all yj 2 R,so |vj| 1. Forming the set sum of the subdi↵erentials then gives the result.

Equipped with these tools from convex analysis, we can now fully characterise the

solutions to the Lasso. We have that �L

�is a Lasso solution if and only if 0 2 @Q�(�

L

�),

which is equivalent to1

nX

T (Y �X�L

�) = �⌫,

for ⌫ with k⌫k1 1 and writing S� = {k : �L

�,k6= 0}, ⌫

S�= sgn(�

L

�,S�).

3.5 Computation

One of the most e�cient ways of computing Lasso solutions is to use a optimisation tech-nique called coordinate descent. This is a quite general way of minimising a functionf : Rd ! R and works particularly well for functions of the form

f(x) = g(x) +dX

j=1

hj(xj)

19

where g is convex and di↵erentiable and each hj : R ! R is convex (and so continuous). Westart with an initial guess of the minimiser x(0) (e.g. x(0) = 0) and repeat for m = 1, 2, . . .

x(m)

1= argmin

x12Rf(x1, x

(m�1)

2, . . . , x(m�1)

d)

x(m)

2= argmin

x22Rf(x(m)

1, x2, x

(m�1)

3, . . . , x(m�1)

d)

...

x(m)

d= argmin

xd2Rf(x(m)

1, x(m)

2, . . . , x(m)

d�1, xd).

Tseng [2001] proves that provided A0 = {x : f(x) f(x(0))} is closed and bounded, thenevery converging subsequence of x(m) will converge to a minimiser of f (which must existsince a continuous function on a closed bounded set attains its bounds). In particular thismeans that f(x(m)) ! f(x⇤) where x

⇤ is a minimiser of f . Moreover if x⇤ is the uniqueminimiser of f then x

(m) ! x⇤.

Coordinate descent is particularly useful in the case of the Lasso because the coordinate-wise updates have closed-form expressions. Indeed

argmin�j2R

⇢1

2nkY �X�jb� �jXjk22 + �|�j|

�= T�(X

T

j(Y �X�jb)/n)

where T�(x) = sgn(x)(|x|��)+ is the so-called soft-thresholding operation. This result caneasily be verified by checking that the given minimiser does satisfy the KKT conditions ofthe objective above.

We often want to solve the Lasso on a grid of � values �0 > · · · > �L (for the purposesof cross-validation for example). To do this, we can first solve for �0, and then solve atsubsequent grid points by using the solution at the previous grid points as an initial guess(known as a warm start). An active set strategy can further speed up computation. Thisworks as follows: For l = 1, . . . , L

1. Initialise Al = {k : �L

�l�1,k6= 0}.

2. Perform coordinate descent only on coordinates in Al obtaining a solution � (allcomponents �

kwith k /2 Al are set to zero).

3. Let V = {k : |XT

k(Y �X�)|/n > �l}, the set of coordinates which violate the KKT

conditions when � is taken as a candidate solution.

4. If V is empty, we set �L

�l= �. Else we update Al = Al [ V and return to 2.

Note that �0 may be taken as kXTY/nk1 as for � larger than this, �

L

�= 0 (you can check

that 0 satisfies the KKT conditions when � larger than �0).

20

3.6 Extensions of the Lasso

3.6.1 Generalised linear models

We can add an `1 penalty to many other log-likelihoods besides that arising from thenormal linear model. For `1-penalised generalised linear models, such as logistic regression,similar theoretical results to those we have obtained are available [van de Geer, 2008] andcomputations can proceed in a similar fashion to above [Friedman et al., 2009].

3.6.2 Removing the bias of the Lasso

One potential drawback of the Lasso is that the same shrinkage e↵ect that sets manyestimated coe�cients exactly to zero also shrinks all non-zero estimated coe�cients towards

zero. One possible solution is to take S� = {k : �L

�,k6= 0} and then re-estimate �0

S�by

OLS regression on XS�. Another option is to re-estimate using the Lasso on X

S�; this

procedure is known as the relaxed Lasso [Meinshausen, 2007].

The adaptive Lasso [Zou, 2006] takes an initial estimate of �0, �init

(e.g. from the Lasso)and then performs weighted Lasso regression:

�adapt

�= argmin

�2Rp:�Scinit

=0

⇢1

2nkY �X�k2

2+ �

X

k2Sinit

|�k||�init

k|

�,

where Sinit = {k : �init

k6= 0}.

Yet another approach involves using a family of non-convex penalty functions p�,� :[0,1) ! [0,1) and attempting to minimise

1

2nkY �X�k2

2+

pX

k=1

p�,�(|�k|).

A prominent example is the minimax concave penalty (MCP) [Zhang, 2010] which takes

p0�(u) =

✓�� u

�

◆

+

.

One disadvantage of using a non-convex penalty is that there may be multiple local minimawhich can make optimisation problematic. However, typically if the non-convexity is nottoo severe, coordinate descent can produce reasonable results.

3.6.3 The elastic net

The elastic net [Zou and Hastie, 2005] uses a weighted combination of ridge and Lassopenalties:

�EN

�,↵= argmin

�2Rp

⇢1

2nkY �X�k2

2+ �{↵k�k1 + (1� ↵)k�k2

2}�.

21

Here ↵ 2 [0, 1] is an additional tuning parameter. The presence of the `2 penalty encouragescoe�cients of variables correlated to other variables with non-zero estimated coe�cientsto also be non-zero.

3.6.4 Group Lasso

The Lasso penalty encourages the estimated coe�cients to be shrunk towards 0 and some-times exactly to 0. Other penalty functions can be constructed to encourage di↵erent typesof sparsity. Suppose we have a partition G1, . . . , Gq of {1, . . . , p} (so [q

k=1Gk = {1, . . . , p},

Gj \Gk = ; for j 6= k). The group Lasso penalty [Yuan and Lin, 2006] is given by

�qX

j=1

mjk�Gjk2.

The multipliers mj > 0 serve to balance cases where the groups are of very di↵erent sizes;typically we choose mj =

p|Gj|. This penalty encourages either an entire group G to have

�G= 0 or �k 6= 0 for all k 2 G. Such a property is useful when groups occur through coding

for categorical predictors or when expanding predictors using basis functions [Ravikumaret al., 2007].

3.6.5 Fused Lasso

If there is a sense in which the coe�cients are ordered, so �0

jis expected to be close to

�0

j+1, a fused Lasso penalty [Tibshirani et al., 2005] may be appropriate. This takes the

form

�1

p�1X

j=1

|�j � �j+1|+ �2k�k1,

where the second term may be omitted depending on whether shrinkage towards 0 isdesired. As an example, consider the simple setting where Yi = µ0

i+ "i, and it is thought

that the (µ0

i)ni=1

form a piecewise constant sequence. Then one option is to minimise overµ 2 Rn, the following objective

1

nkY � µk2

2+ �

n�1X

i=1

|µi � µi+1|.

4 Graphical modelling

So far we have considered the problem of relating a particular response to a large collectionof explanatory variables. In some settings however, we do not have a distinguished responsevariable and instead we would like to better understand relationships between all thevariables. We may, for example, wish to identify variables that are ‘directly related’ toeach other in some sense. Trying to find pairs of variables that are independent and so

22

unlikely to be related to each other is not necessarily a good way to proceed as each variablemay be correlated with a large number of variables without being directly related to them.A potentially better approach is to use conditional independence.

Definition 4. If X, Y and Z are random vectors with a joint density fXYZ then we sayX is conditionally independent of Y given Z, and write

X ?? Y|Z

iffXY|Z(x,y|z) = fX|Z(x|z)fY|Z(y|z).

EquivalentlyX ?? Y|Z () fX|YZ(x|y, z) = fX|Z(x|z) :

the conditional distribution of X given Y and Z depends only on Z.

4.1 Conditional independence graphs

Graphs provide a convenient way to visualise conditional independencies between randomvariables. A graph G is a pair (V,E) where V is a set and E is a subset of

{{x, y} : x, y 2 V, x 6= y},

the set of unordered distinct pairs from V . We call the members of V vertices and membersof E edges. In the graphs that we consider, we will always take V to be {1, . . . , |V |}.

Given j, k 2 V a path of length m from j to k is a sequence j = j1, j2, . . . , jm = k ofdistinct vertices such that {ji, ji+1} 2 E, i = 1, . . . ,m� 1.

Given a triple of subsets of vertices A,B, S, we say S separates A from B if every pathfrom a vertex in A to a vertex in B contains a vertex in S.

Let Z = (Z1, . . . , Zp)T be a collection of random variables with joint law P and considera graph G = (V,E) where V = {1, . . . , p}. We say that P satisfies the pairwise Markov

property w.r.t. G if for any pair j, k 2 V with j 6= k and {j, k} /2 E,

Zj ?? Zk|Z�jk.

Note that the complete graph that has edges between every pair of vertices will satisfy thepairwise Markov property for any P . The minimal graph satisfying the pairwise Markovproperty w.r.t. a given P is called the conditional independence graph (CIG) for P . Thisdoes not have an edge between j and k if and only if Zj ?? Zk|Z�jk.

We say P satisfies the global Markov property w.r.t. G if for any triple (A,B, S) ofdisjoint subsets of V such that S separates A from B, we have

ZA ?? ZB|ZS.

Proposition 13 (See Lauritzen [1996] for example). If P has a positive density then if it

satisfies the pairwise Markov property w.r.t. a graph G, it also satisfies the global Markov

property w.r.t. G and vice versa.

23

4.2 Gaussian graphical models

Estimating the CIG given a sample drawn from P is a di�cult task in general (indeed seeShah and Peters [2018] for an impossibility result on conditional independence testing).However, in the case where P is multivariate Gaussian, things simplify considerably as weshall see.

The following result on the conditional distributions of a multivariate normal (see thepreliminary material for a proof) will be central to our discussion. Let Z ⇠ Np(µ,⌃) with⌃ positive definite. Note ⌃A,A is also positive definite for any A.

Proposition 14.

ZA|ZB = zB ⇠ N|A|(µA+⌃A,B⌃

�1

B,B(zB � µ

B), ⌃A,A �⌃A,B⌃

�1

B,B⌃B,A)

4.2.1 Neighbourhood selection

Specialising to the case where A = {k} and B = {k}c we see that when conditioning onZ�k = z�k, we may write

Zk = mk + zT

�k⌃

�1

�k,�k⌃�k,k + "k,

where

mk = µk �⌃k,�k⌃�1

�k,�kµ�k

"k|Z�k = z�k ⇠ N (0, ⌃k,k �⌃k,�k⌃�1

�k,�k⌃�k,k).

Note that if the jth element of the vector of coe�cients ⌃�1

�k,�k⌃�k,k is zero, then the

distribution of Zk conditional on Z�k will not depend at all on the jth component of Z�k.Then if that jth component was Zj0 , we would have that Zk|Z�k = z�k has the samedistribution as Zk|Z�j0k = z�j0k, so Zk ?? Zj|Z�j0k.

Thus given x1, . . . ,xn

i.i.d.⇠ Z and writing

X =

0

B@xT

1

...xT

n

1

CA ,

we may estimate the coe�cient vector ⌃�1

�k,�k⌃�k,k by regressing Xk on X�k and including

an intercept term.The technique of neighbourhood selection [Meinshausen and Buhlmann, 2006] involves

performing such a regression for each variable, using the Lasso. There are two options forpopulating our estimate of the CIG with edges based on the Lasso estimates. Writing Sk

for the selected set of variables when regressing Xk on X�k, we can use the “OR” rule andput an edge between vertices j and k if and only if k 2 Sj or j 2 Sk. An alternative is the“AND” rule where we put an edge between j and k if and only if k 2 Sj and j 2 Sk.

Another popular approach to estimating the CIG works by estimating the precisionmatrix ⌃

�1.

24

4.2.2 Schur complements and the precision matrix

The following facts about blockwise inversion of matrices will help us to interpret the meanand variance in Proposition 14.

Proposition 15. Let M 2 Rp⇥pbe a symmetric positive definite matrix and suppose

M =

✓P Q

T

Q R

◆

with P and R square matrices. The Schur complement of R is P �QTR

�1Q =: S. We

have that S is positive definite and

M�1 =

✓S�1 �S

�1Q

TR

�1

�R�1QS

�1R

�1 +R�1QS

�1Q

TR

�1

◆.

Set M to be ⌃ and let ⌦ = ⌃�1 be the precision matrix. Taking P = ⌃11 we see that

⌦�1,1 = �⌃�1

�1,�1⌃�1,1⌦11.

By symmetry we then also have ⌃�1

�k,�k⌃�k,k = �⌦�1

kk⌦�k,k so

(⌃�1

�k,�k⌃�k,k)j = 0 ,

(⌦j,k = 0 for j < k

⌦j+1,k = 0 for j � k.

ThusZk ?? Zj|Z�jk , ⌦jk = 0.

This motivates another approach to estimating the CIG.

4.2.3 The Graphical Lasso

Recall that the density of Np(µ,⌃) is

f(z) =1

(2⇡)p/2det(⌃)1/2exp

✓� 1

2(z� µ)T⌃�1(z� µ)

◆.

The log-likelihood of (µ,⌃) based on an i.i.d. sample x1, . . . ,xn is

`(µ,⌦) =n

2log det(⌦)� 1

2

nX

i=1

(xi � µ)T⌦(xi � µ).

Write

X =1

n

nX

i=1

xi, S =1

n

nX

i=1

(xi � X)(xi � X)T .

25

Then

nX

i=1

(xi � µ)T⌦(xi � µ) =nX

i=1

(xi � X+ X� µ)T⌦(xi � X+ X� µ)

=nX

i=1

(xi � X)T⌦(xi � X) + n(X� µ)T⌦(X� µ)

+ 2nX

i=1

(xi � X)T⌦(X� µ).

Also,

nX

i=1

(xi � X)T⌦(xi � X) =nX

i=1

tr{(xi � X)T⌦(xi � X)}

=nX

i=1

tr{(xi � X)(xi � X)T⌦}

= ntr(S⌦).

Thus`(µ,⌦) = �n

2{tr(S⌦)� log det(⌦) + (X� µ)T⌦(X� µ)}

andmaxµ2Rp

`(µ,⌦) = �n

2{tr(S⌦)� log det(⌦)}.

Hence the maximum likelihood estimate of ⌦ can be obtained by solving

min⌦:⌦�0

{� log det(⌦) + tr(S⌦)},

where ⌦ � 0 means ⌦ is positive definite. One can show that the objective is convex andwe are minimising over a convex set.

The graphical Lasso [Yuan and Lin, 2007, Friedman et al., 2008] penalises the log-likelihood for ⌦ and solves

min⌦:⌦�0

{� log det(⌦) + tr(S⌦) + �k⌦k1},

where k⌦k1 =P

j,k|⌦jk|; this results in a sparse estimate of the precision matrix from

which an estimate of the CIG can be constructed.

5 High-dimensional inference

In many modern applications, we may be interested in testing many hypotheses simulta-neously. Suppose we are interested in testing null hypotheses H1, . . . , Hm and Hi, i 2 I0

26

are the true null hypotheses with |I0| = m0 (we do not mention the alternative hypothesesexplicitly). We will suppose we have available p-values p1, . . . , pm for each of the hypothesesso

P(pi ↵) ↵

for all ↵ 2 [0, 1], i 2 I0. Given a multiple testing procedure takes as input the vector ofp-values, and outputs a subset R ✓ {1, . . . ,m} of rejected hypotheses. Let N = |R \ I0|be the number of falsely rejected hypotheses, and let R = |R| be the number of rejections.

5.1 Family-wise error rate control

Traditional approaches to multiple testing have sought to control the familywise error rate(FWER) defined by

FWER = P(N � 1)

at a prescribed level ↵; i.e. find procedures for which FWER ↵. The simplest suchprocedure is the Bonferroni correction, which rejects Hi if pi ↵/m.

Theorem 16. Using Bonferroni correction,

P(N � 1) E(N) m0↵

m ↵.

Proof. The first inequality comes from Markov’s inequality. Next

E(N) = E✓X

i2I0{pi↵/m}

◆

=X

i2I0

P(pi ↵/m)

m0↵

m.

A more sophisticated approach is the closed testing procedure. Given our family ofhypotheses {Hi}mi=1

, define the closure of this family to be

{HI : I ✓ {1, . . . ,m}, I 6= ;}

where HI = \i2IHi is known as an intersection hypothesis (HI is the hypothesis that allHi i 2 I are true).

Suppose that for each I, we have an ↵-level test �I taking values in {0, 1} for testingHI (we reject if �I = 1), so under HI ,

PHI (�I = 1) ↵.

The �I are known as local tests.The closed testing procedure [Marcus et al., 1976] is defined as follows:

27

Reject HI if and only if for all J ◆ I,

HJ is rejected by the local test �J .

Typically we only make use of the individual hypotheses that are rejected by the procedurei.e. those rejected HI where I is a singleton.

We consider the case of 4 hypotheses as an example. Suppose the underlined hypothesesare rejected by the local tests.

H1234

H123 H124 H134 H234

H12 H13 H14 H23 H24 H34

H1 H2 H3 H4

• Here H1 is rejected be the closed testing procedure.

• H2 is not rejected by the closed testing procedure as H24 is not rejected by the localtest.

• H23 is rejected by the closed testing procedure.

Theorem 17. The closed testing procedure makes no false rejections with probability 1�↵.In particular it controls the FWER at level ↵.

Proof. Assume I0 is not empty (as otherwise no rejection can be false anyway). In orderfor the procedure to make a false rejection, we must have �I0 = 1, but this will occur withprobability at most ↵.

Di↵erent choices for the local tests give rise to di↵erent testing procedures. Holm’s

procedure [Holm, 1979] takes �I to be the Bonferroni test i.e.

�I =

(1 if mini2I pi ↵

|I|

0 otherwise.

To understand what Holm’s procedure does, let us order the p-values p1, . . . , pm as p(1) · · · p(m) with corresponding hypothesis tests H(1), . . . , H(m), so (i) is the index of the ithsmallest p-value. First consider under what circumstances H(1) is rejected. All subsets Icontaining (1) must have mini2I pi = p(1) ↵/|I|. The minimum of the RHS occurs whenI is the full set {1, . . . ,m} so we conclude H(1) is rejected if and only if p(1) ↵/m.

Similarly H(2) will be rejected when all I containing (2) have mini2I pi ↵/|I|. Thuswe must certainly have p(1) ↵/m, in which case we only need to ensure all I containing(2) but not (1) have mini2I pi = p(2) ↵/|I|. The minimum of the RHS occurs whenI = {1, . . . ,m} \ {1}, so we must have p(2) ↵/(m � 1). Continuing this argument, wesee that Holm’s procedure amounts to the following:

28

Step 1. If p(1) ↵/m reject H(1), and go to step 2. Otherwise accept H(1), . . . , H(m) andstop.

Step i. If p(i) ↵/(m�i+1), rejectH(i) and go to step i+1. Otherwise acceptH(i), . . . , H(m).

Step m. If p(m) ↵, reject H(m). Otherwise accept H(m).

The p-values are visited in ascending order and rejected until the first time a p-value exceedsa given critical value. This sort of approach is known (slightly confusingly) as a step-down

procedure.

5.2 The False Discovery Rate

A di↵erent approach to multiple testing does not try to control the FWER, but insteadattempts to control the false discovery rate (FDR) defined by

FDR = E(FDP)

FDP =N

max(R, 1),

where FDP is the false discovery proportion. Note the maximum in the denominator isto ensure division by zero does not occur. The FDR was introduced in Benjamini andHochberg [1995], and it is now widely used across science, particularly biostatistics.

The Benjamini–Hochberg procedure attempts to control the FDR at level ↵ and worksas follows. Let

k = max

⇢i : p(i)

i↵

m

�.

Reject H(1), . . . , H(k)(or perform no rejections if k is not defined).

Theorem 18. Suppose that the pi, i 2 I0 are independent, and independent of {pi :i /2 I0}. Then the Benjamini–Hochberg procedure controls the FDR at level ↵; in fact

FDR ↵m0/m.

Proof. For each i 2 I0, let Ri denote the number of rejections we get by applying a modifiedBenjamini–Hochberg procedure to

p\i := {p1, p2, . . . , pi�1, pi+1, . . . , pm}

with cuto↵

ki = max

⇢j : p\i

(j) ↵(j + 1)

m

�,

where p\i(j)

is the jth smallest p-value in the set p\i.

29

For r = 1, . . . ,m and i 2 I0, note that⇢pi

↵r

m, R = r

�=

⇢pi

↵r

m, p(r)

↵r

m, p(s) >

↵s

mfor all s > r

�

=

⇢pi

↵r

m, p\i

(r�1) ↵r

m, p\i

(s�1)>

↵s

mfor all s > r

�

=

⇢pi

↵r

m, Ri = r � 1

�.

Thus

FDR = E✓

N

max(R, 1)

◆

=mX

r=1

E✓N

r {R=r}

◆

=mX

r=1

1

rE✓X

i2I0{pi↵r/m} {R=r}

◆

=mX

r=1

1

r

X

i2I0

P(pi ↵r/m, R = r)

=mX

r=1

1

r

X

i2I0

P(pi ↵r/m)P(Ri = r � 1)

↵

m

X

i2I0

mX

r=1

P(Ri = r � 1)

=↵m0

m.

5.3 Hypothesis testing in high-dimensional regression

Consider the normal linear model Y = X�0 + " where " ⇠ Nn(0, �2I). In the low-

dimensional setting, the fact that �OLS

� �0 ⇠ Np(0, �2(XTX)�1) allows us to perform

hypothesis tests with H0 : �0

j= 0, for example.

One might hope that studying the distribution of �L

�� 0 would enable us to do this

in the high-dimensional setting when p � n. However, the distribution of �L

�� 0 is

intractable and depends delicately on the unknown �0, making it unsuitable as a basisperforming formal hypothesis tests. Other approaches must therefore be used and belowwe review some of these (see also Dezeure et al. [2015] for a broader review).

Stability selection. Stability selection [Meinshausen and Buhlmann, 2010, Shah andSamworth, 2013] is a general technique for uncertainty quantification when performing

30

variable selection. Given a base variable selection procedure (e.g. S� from the Lasso,or the first q set of variables selected by forward selection) it advocates applying this tosubsamples of the data of size bn/2c. We can then compute the proportion of times ⇡j

that the jth variable was selected across the di↵erent subsamples, for each j. We can takeas our final selected set S⌧ = {j : ⇡j � ⌧} for some threshold ⌧ . One advantage oversimply taking the set of variables selected by the base selection procedure is that boundsare available on E(|S⌧ \ Sc|). However formal hypothesis tests are unavailable.

Sample splitting. The approach of Meinshausen et al. [2009] exploits the variablescreening property of the Lasso (Corollary 8) that provided the true non-zero coe�cientsare su�ciently far from 0, the set of nonzeroes of the Lasso will contain the true set S.Specifically, first the data are split into two halves: (Y(1),X(1)), (Y(2),X(2)). The Lasso

is applied on (Y(1),X(1)) giving a selected set S. Next OLS is applied on (Y(2),X(2)

S) and

p-values are obtained for those variables in S. The p-values for those variables in Sc areset to 1. This process is applied using di↵erent random splits of the data and the resultingcollection of p-values are aggregated to give a single p-value for each variable. Though thisworks reasonably well in practice, a drawback is that the validity of the p-values requiresa condition on the unknown �0. Given that the task is to perform inference for �0, suchconditions are undesirable.

Recent advances. In the last few years, new methods have been introduced based ondebiasing the Lasso that overcome the issue above [Zhang and Zhang, 2014, Van de Geeret al., 2014]. Extending and improving on these methods is currently a highly active areaof research. We will try to give a flavour of some of these developments by outlining asimple method for performing hypothesis testing based on some material from Shah andBuhlmann [2018]. Consider testing Hj : �0

j6= 0 so under the null hypothesis

Y = X�j�0

�j+ "

(we will ignore the intercept term for simplicity). Further assume " ⇠ Nn(0, �2I). Idea:

if the null model is true, the residuals from regressing Y on X�j using the Lasso shouldlook roughly like ", and so their correlation with Xj should be small in absolute value. Onthe other hand, if �0

j6= 0, the residuals should contain some trace of �0

jXj and so perhaps

correlation with Xj will be larger in absolute value.Let � be the Lasso estimate from regression of Y on X�j with tuning parameter

A�plog(p)/n. Now under the null, the residuals are X�j(�

0

�j� �)+ ". If we consider the

dot product with Xj, we get

1pnX

T

jX�j(�

0

�j� ��j

)| {z }

negligible?

+1pnX

T

j"

| {z }⇠N (0,�2)

. (5.1)

31

Unfortunately however, though we know from Theorem 7 that

k�0

�j� �k1

12A�

�2sp

log(p)/n

with high probability, kXT

�jXjk1/

pn is the maximum of a sum of n terms and would

typically be growing like n/pn =

pn. We cannot therefore apply Holder’s inequality to

conclude that the first term is negligible as desired. Instead consider defining Wj to be theresiduals from regressing Xj on toX�j using the Lasso with tuning parameter B

plog(p)/n

for some B > 0. The KKT conditions give us that

1

nkXT

�jWjk1 B

plog(p)/n.

By replacing Xj with its decorrelated version Wj and choosing ⌧ appropriately, we canensure that the first term in (5.1) is negligible. Indeed by Holder’s inequality,

|WT

jX�j(�

0

�j� �)|/kWjk2 12AB� log(p)s/kWjk2

with high probability.Thus under the null hypothesis, our test statistic W

T

j(Y �X�j�)/kWjk2 is approxi-

mately N (0, �2) provided � log(p)s/kWjk2 is small. Making use of the square-root Lassogives a similar test statistic whose approximate null distribution is in fact standard normalthereby allowing for the computation of p-values. One can also show that the test hasoptimal power in a certain sense [Van de Geer et al., 2014, Shah and Buhlmann, 2018].

References

A. Belloni, V. Chernozhukov, and L. Wang. Square-root lasso: pivotal recovery of sparsesignals via conic programming. Biometrika, 98(4):791–806, 2011.

Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical andpowerful approach to multiple testing. Journal of the Royal Statistical Society, Series

B, 57:289–300, 1995.

L. Breiman. Stacked regressions. Machine Learning, 24:49–64, 1996.

R. Dezeure, P. Buhlmann, L. Meier, N. Meinshausen, et al. High-dimensional inference:Confidence intervals, p-values and r-software hdi. Statistical Science, 30(4):533–558,2015.

J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with thegraphical lasso. Biostatistics, 9:432, 2008.

J. Friedman, T. Hastie, and R. Tibshirani. glmnet: Lasso and elastic-net regularizedgeneralized linear models. R package version, 1, 2009.

32

A. Hoerl and R. Kennard. Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, pages 55–67, 1970.

T. Hofmann, B. Scholkopf, and A. Smola. Kernel methods in machine learning. Annals ofStatistics, pages 1171–1220, 2008.

S. Holm. A simple sequentially rejective multiple test procedure. Scandinavian Journal of

Statistics, 6:65–70, 1979.

G. S. Kimeldorf and G. Wahba. A correspondence between bayesian estimation on stochas-tic processes and smoothing by splines. The Annals of Mathematical Statistics, 41(2):495–502, 1970.

S. Lauritzen. Graphical Models. Oxford University Press, 1996.

R. Marcus, P. Eric, and K. R. Gabriel. On closed testing procedures with special referenceto ordered analysis of variance. Biometrika, 63(3):655–660, 1976.

N. Meinshausen. Relaxed lasso. Computational Statistics and Data Analysis, 52:374–393,2007.

N. Meinshausen and P. Buhlmann. High dimensional graphs and variable selection withthe lasso. Annals of Statistics, 34:1436–1462, 2006.

N. Meinshausen and P. Buhlmann. Stability selection. Journal of the Royal Statistical

Society: Series B, 72:417–473, 2010.

N. Meinshausen, L. Meier, and P. Buhlmann. P-values for high-dimensional regression.Journal of the American Statistical Association, 104:1671–1681, 2009.

G. Raskutti, M. J. Wainwright, and B. Yu. Restricted eigenvalue properties for correlatedgaussian designs. The Journal of Machine Learning Research, 11:2241–2259, 2010.

P. Ravikumar, H. Liu, J. D. La↵erty, and L. A. Wasserman. Spam: Sparse additive models.In NIPS, pages 1201–1208, 2007.

B. Scholkopf and A. J. Smola. Learning with kernels: support vector machines, regulariza-

tion, optimization, and beyond. MIT press, 2001.

B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. In Inter-

national Conference on Computational Learning Theory, pages 416–426. Springer, 2001.

R. D. Shah and P. Buhlmann. Goodness-of-fit tests for high dimensional linear models.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(1):113–135, 2018.

R. D. Shah and J. Peters. The hardness of conditional independence testing and thegeneralised covariance measure. arXiv preprint arXiv:1804.07203, 2018.

33

R. D. Shah and R. J. Samworth. Variable selection with error control: another lookat stability selection. Journal of the Royal Statistical Society: Series B (Statistical

Methodology), 75(1):55–80, 2013.

T. Sun and C.-H. Zhang. Scaled sparse linear regression. Biometrika, 99(4):879–898, 2012.

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal

Statistical Society, Series B, 58:267–288, 1996.

R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smooth-ness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical

Methodology), 67(1):91–108, 2005.

P. Tseng. Convergence of a block coordinate descent method for nondi↵erentiable mini-mization. Journal of optimization theory and applications, 109(3):475–494, 2001.

S. van de Geer. High-dimensional generalized linear models and the lasso. Annals of

Statistics, 36:614–645, 2008.

S. Van de Geer, P. Buhlmann, Y. Ritov, R. Dezeure, et al. On asymptotically optimalconfidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3):1166–1202, 2014.

M. Wainwright. Sharp thresholds for high-dimensional and noisy recovery of sparsity. IEEETransactions on Information Theory, 55:2183–2202, 2009.

D. Wolpert. Stacked generalization. Neural Networks, 5:241–259, 1992.

M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables.Journal of the Royal Statistical Society: Series B, 68:49–67, 2006.

M. Yuan and Y. Lin. Model selection and estimation in the gaussian graphical model.Biometrika, 94(1):19–35, 2007.

C.-H. Zhang. Nearly unbiased variable selection under minimax concave penalty. The

Annals of statistics, pages 894–942, 2010.

C.-H. Zhang and S. S. Zhang. Confidence intervals for low dimensional parameters in highdimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical

Methodology), 76(1):217–242, 2014.

P. Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning

Research, 7:2541–2563, 2006.

H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical

Association, 101:1418–1429, 2006.

H. Zou and T. Hastie. Regularization and variable selection via the elastic net. JRSS B,67:301–320, 2005.

34

apts high-dimensional statistics - warwick

Documents