feature selection

48
Statistica Sinica 20 (2010), 101-148 Invited Review Article A SELECTIVE OVERVIEW OF VARIABLE SELECTION IN HIGH DIMENSIONAL FEATURE SPACE Jianqing Fan and Jinchi Lv Princeton University and University of Southern California Abstract: High dimensional statistical problems arise from diverse fields of scientific research and technological development. Variable selection plays a pivotal role in contemporary statistical learning and scientific discoveries. The traditional idea of best subset selection methods, which can be regarded as a specific form of pe- nalized likelihood, is computationally too expensive for many modern statistical applications. Other forms of penalized likelihood methods have been successfully developed over the last decade to cope with high dimensionality. They have been widely applied for simultaneously selecting important variables and estimating their effects in high dimensional statistical inference. In this article, we present a brief ac- count of the recent developments of theory, methods, and implementations for high dimensional variable selection. What limits of the dimensionality such methods can handle, what the role of penalty functions is, and what the statistical properties are rapidly drive the advances of the field. The properties of non-concave penalized likelihood and its roles in high dimensional statistical modeling are emphasized. We also review some recent advances in ultra-high dimensional variable selection, with emphasis on independence screening and two-scale methods. Key words and phrases: Dimensionality reduction, folded-concave penalty, high dimensionality, LASSO, model selection, oracle property, penalized least squares, penalized likelihood, SCAD, sure independence screening, sure screening, variable selection. 1. Introduction High dimensional data analysis has become increasingly frequent and im- portant in diverse fields of sciences, engineering, and humanities, ranging from genomics and health sciences to economics, finance and machine learning. It characterizes many contemporary problems in statistics (Hastie, Tibshirani and Friedman (2009)). For example, in disease classification using microarray or pro- teomics data, tens of thousands of expressions of molecules or ions are potential predictors; in genowide association studies between genotypes and phenotypes,

Upload: jeromeku

Post on 15-Sep-2015

25 views

Category:

Documents


0 download

DESCRIPTION

s

TRANSCRIPT

  • Statistica Sinica 20 (2010), 101-148

    Invited Review Article

    A SELECTIVE OVERVIEW OF VARIABLE SELECTION

    IN HIGH DIMENSIONAL FEATURE SPACE

    Jianqing Fan and Jinchi Lv

    Princeton University and University of Southern California

    Abstract: High dimensional statistical problems arise from diverse elds of scientic

    research and technological development. Variable selection plays a pivotal role in

    contemporary statistical learning and scientic discoveries. The traditional idea

    of best subset selection methods, which can be regarded as a specic form of pe-

    nalized likelihood, is computationally too expensive for many modern statistical

    applications. Other forms of penalized likelihood methods have been successfully

    developed over the last decade to cope with high dimensionality. They have been

    widely applied for simultaneously selecting important variables and estimating their

    eects in high dimensional statistical inference. In this article, we present a brief ac-

    count of the recent developments of theory, methods, and implementations for high

    dimensional variable selection. What limits of the dimensionality such methods can

    handle, what the role of penalty functions is, and what the statistical properties

    are rapidly drive the advances of the eld. The properties of non-concave penalized

    likelihood and its roles in high dimensional statistical modeling are emphasized.

    We also review some recent advances in ultra-high dimensional variable selection,

    with emphasis on independence screening and two-scale methods.

    Key words and phrases: Dimensionality reduction, folded-concave penalty, high

    dimensionality, LASSO, model selection, oracle property, penalized least squares,

    penalized likelihood, SCAD, sure independence screening, sure screening, variable

    selection.

    1. Introduction

    High dimensional data analysis has become increasingly frequent and im-portant in diverse elds of sciences, engineering, and humanities, ranging fromgenomics and health sciences to economics, nance and machine learning. Itcharacterizes many contemporary problems in statistics (Hastie, Tibshirani andFriedman (2009)). For example, in disease classication using microarray or pro-teomics data, tens of thousands of expressions of molecules or ions are potentialpredictors; in genowide association studies between genotypes and phenotypes,

  • 102 JIANQING FAN AND JINCHI LV

    hundreds of thousands of SNPs are potential covariates for phenotypes such ascholesterol levels or heights. When interactions are considered, the dimension-ality grows quickly. For example, in portfolio allocation among two thousandstocks, it involves already over two million parameters in the covariance ma-trix; interactions of molecules in the above examples result in ultra-high dimen-sionality. To be more precise, throughout the paper ultra-high dimensionalityrefers to the case where the dimensionality grows at a non-polynomial rate asthe sample size increases, and high dimensionality refers to the general case ofgrowing dimensionality. Other examples of high dimensional data include high-resolution images, high-frequency nancial data, e-commerce data, warehousedata, functional, and longitudinal data, among others. Donoho (2000) convinc-ingly demonstrates the need for developments in high dimensional data analysis,and presents the curses and blessings of dimensionality. Fan and Li (2006) givea comprehensive overview of statistical challenges with high dimensionality ina broad range of topics and, in particular, demonstrate that for a host of sta-tistical problems, the model parameters can be estimated as well as if the bestmodel is known in advance, as long as the dimensionality is not excessively high.The challenges that are not present in smaller scale studies have been reshapingstatistical thinking, methodological development, and theoretical studies.

    Statistical accuracy, model interpretability, and computational complexityare three important pillars of any statistical procedures. In conventional studies,the number of observations n is much larger than the number of variables orparameters p. In such cases, none of the three aspects needs to be sacricedfor the eciency of others. The traditional methods, however, face signicantchallenges when the dimensionality p is comparable to, or larger than, the samplesize n. These challenges include how to design statistical procedures that aremore ecient in inference; how to derive the asymptotic or nonasymptotic theory;how to make the estimated models interpretable; and how to make the statisticalprocedures computationally ecient and robust.

    A notorious diculty of high dimensional model selection comes from thecollinearity among the predictors. The collinearity can easily be spurious inhigh dimensional geometry (Fan and Lv (2008)), which can make us select awrong model. Figure 1 shows the maximum sample correlation and multiplecorrelation with a given predictor despite predictors that are generated fromindependent Gaussian random variables. As a result, any variable can be well-approximated even by a couple of spurious variables, and can even be replacedby them when the dimensionality is much higher than the sample size. If thatvariable is a signature predictor and is replaced by spurious variables, we choosewrong variables to associate the covariates with the response and, even worse, thespurious variables can be independent of the response at population level, leading

  • VARIABLE SELECTION IN HIGH DIMENSIONAL FEATURE SPACE 103

    Figure 1. Distributions (left panel) of the maximum absolute sample cor-relation coecient max2jp jcorr(Z1; Zj)j, and distributions (right panel)of the maximum absolute multiple correlation coecient of Z1 with 5 othervariables (maxjSj=5 jcorr(Z1;ZTS ^S)j, where ^S is the regression coecientof Z1 regressed on ZS , a subset of variables indexed by S and excluding Z1),computed by the stepwise addition algorithm (the actual values are largerthan what are presented here), when n = 50, p = 1; 000 (solid curve) andp = 10; 000 (dashed), based on 1,000 simulations.

    to completely wrong scientic conclusions. Indeed, when the dimensionality p islarge, intuition might not be accurate. This is also exemplied by the data pilingproblems in high dimensional space observed in Hall, Marron and Neeman (2005).Collinearity also gives rise to issues of over-tting and model mis-identication.

    Noise accumulation in high dimensional prediction has long been recognizedin statistics and computer sciences. Explicit characterization of this is well-knownfor high dimensional regression problems. The quantication of the impact of di-mensionality on classication was not well understood until Fan and Fan (2008),who give a simple expression on how dimensionality impacts misclassicationrates. Hall, Pittelkow and Ghosh (2008) study a similar problem for distancedbased-classiers and observe implicitly the adverse impact of dimensionality. Asshown in Fan and Fan (2008), even for the independence classication rule de-scribed in Section 4.2, classication using all features can be as bad as a randomguess due to noise accumulation in estimating the population centroids in high di-mensional feature space. Therefore, variable selection is fundamentally importantto high dimensional statistical modeling, including regression and classication.

    What makes high dimensional statistical inference possible is the assumptionthat the regression function lies in a low dimensional manifold. In such cases,the p-dimensional regression parameters are assumed to be sparse with manycomponents being zero, where nonzero components indicate the important vari-ables. With sparsity, variable selection can improve the estimation accuracy by

  • 104 JIANQING FAN AND JINCHI LV

    eectively identifying the subset of important predictors and can enhance themodel interpretability with parsimonious representation. It can also help reducethe computational cost when sparsity is very high.

    This notion of sparsity is in a narrow sense. It should be understood morewidely in transformed or enlarged feature spaces. For instance, some prior knowl-edge may lead us to apply some grouping or transformation of the input variables(see, e.g., Fan and Lv (2008)). Some transformation of the variables may be ap-propriate if a signicant portion of the pairwise correlations are high. In somecases, we may want to enlarge the feature space by adding interactions and higherorder terms to reduce the bias of the model. Sparsity can also be viewed in thecontext of dimensionality reduction by introducing a sparse representation, i.e.,by reducing the number of eective parameters in estimation. Examples includethe use of a factor model for high dimensional covariance matrix estimation inFan, Fan and Lv (2008).

    Sparsity arises in many scientic endeavors. In genomic studies, it is gener-ally believed that only a fraction of molecules are related to biological outcomes.For example, in disease classication, it is commonly believed that only tens ofgenes are responsible for a disease. Selecting tens of genes helps not only statis-ticians in constructing a more reliable classication rule, but also biologists tounderstand molecular mechanisms. In contrast, popular but naive methods usedin microarray data analysis (Dudoit, Shaer and Boldrick (2003), Storey andTibshirani (2003), Fan and Ren (2006), and Efron (2007)) rely on two-sampletests to pick important genes, which is truly a marginal correlation ranking (Fanand Lv (2008)) and can miss important signature genes (Fan, Samworth andWu (2009)). The main goals of high dimensional regression and classication,according to Bickel (2008), are

    to construct as eective a method as possible to predict future observations; to gain insight into the relationship between features and response for scienticpurposes, as well as, hopefully, to construct an improved prediction method.

    The former appears in problems such as text and document classication orportfolio optimization, whereas the latter appears naturally in many genomicstudies and other scientic endeavors.

    As pointed out in Fan and Li (2006), it is helpful to dierentiate two typesof statistical endeavors in high dimensional statistical learning: accuracy of es-timated model parameters and accuracy of the expected loss of the estimatedmodel. The latter property is called persistence in Greenshtein and Ritov (2004)and Greenshtein (2006), and arises frequently in machine learning problems suchas document classication and computer vision. The former appears in many

  • VARIABLE SELECTION IN HIGH DIMENSIONAL FEATURE SPACE 105

    other contexts where we want to identify the signicant predictors and charac-terize the precise contribution of each to the response variable. Examples includehealth studies, where the relative importance of identied risk factors needs tobe assessed for prognosis. Many of the existing results in the literature havebeen concerned with the study of consistency of high dimensional variable se-lection methods, rather than characterizing the asymptotic distributions of theestimated model parameters. However, consistency and persistence results areinadequate for understanding uncertainty in parameter estimation.

    High dimensional variable selection encompasses a majority of frontiers wherestatistics advances rapidly today. There has been an evolving literature in thelast decade devoted to understanding the performance of various variable selec-tion techniques. The main theoretical questions include determining the limitsof the dimensionality that such methods can handle and how to characterize theoptimality of variable selection procedures. The answers to the rst question formany existing methods were largely unknown until recently. To a large extent,the second question still remains open for many procedures. In the Gaussianlinear regression model, the case of orthonormal design reduces to the problemof Gaussian mean estimation, as do the wavelet settings where the design matri-ces are orthogonal. In such cases, the risks of various shrinkage estimators andtheir optimality have been extensively studied. See, e.g., Donoho and Johnstone(1994) and Antoniadis and Fan (2001).

    In this article we address the issues of variable selection for high dimensionalstatistical modeling in the unied framework of penalized likelihood estimation.It has been widely used in statistical inferences and machine learning, and isbasically a moderate scale learning technique. We also give an overview onthe techniques for ultrahigh dimensional screening. Combined iteratively withlarge scale screening, it can handle problems of ultra-high dimensionality (Fan,Samworth and Wu (2009)). This will be reviewed as well.

    The rest of the article is organized as follows. In Section 2, we discuss theconnections of penalized likelihood to classical model selection methods. Section3 details the methods and implementation of penalized likelihood estimation.We review some recent advances in ultra-high dimensional variable selection inSection 4. In Section 5, we survey the sampling properties of penalized leastsquares. Section 6 presents the classical oracle property of penalized least squaresand penalized likelihood methods in ultra-high dimensional space. We concludethe article with some additional remarks in Section 7.

    2. Classical Model Selection

    Suppose that the available data are (xTi ; yi)ni=1, where yi is the i-th obser-

    vation of the response variable and xi is its associated p-dimensional covariates

  • 106 JIANQING FAN AND JINCHI LV

    vector. They are usually assumed to be a random sample from the population(XT ; Y ), where the conditional mean of Y given X depends on the linear predic-tor TX with = (1; : : : ; p)T . In sparse modeling, it is frequently assumedthat most regression coecients j are zero. Variable selection aims to identifyall important variables whose regression coecients do not vanish and to provideeective estimates of those coecients.

    More generally, assume that the data are generated from the true densityfunction f0 with parameter vector 0 = (1; : : : ; d)

    T . Often, we are uncertainabout the true density, but more certain about a larger family of models f1 inwhich 0 is a (nonvanishing) subvector of the p-dimensional parameter vector1. The problems of how to estimate the dimension of the model and comparemodels of dierent dimensions naturally arise in many statistical applications,including time series modeling. These are referred to as model selection in theliterature.

    Akaike (1973, 1974) proposes to choose a model that minimizes the Kullback-Leibler (KL) divergence of the tted model from the true model. Akaike (1973)considers the maximum likelihood estimator (MLE) b = (^1; : : : ; ^p)T of theparameter vector and shows that, up to an additive constant, the estimatedKL divergence can be asymptotically expanded as

    `n(b) + dim(b) = `n(b) + pXj=1

    I(^j 6= 0);

    where `n() is the log-likelihood function, dim() denotes the dimension of themodel, and = 1. This leads to the AIC. Schwartz (1978) takes a Bayesianapproach with prior distributions that have nonzero prior probabilities on somelower dimensional subspaces and proposes the BIC with = (log n)=2 for modelselection. Recently, Lv and Liu (2008) gave a KL divergence interpretation ofBayesian model selection and derive generalizations of AIC and BIC when themodel may be misspecied.

    The work of AIC and BIC suggests a unied approach to model selection:choose a parameter vector that maximizes the penalized likelihood

    `n() kk0; (2.1)where the L0-norm of counts the number of non-vanishing components in and 0 is a regularization parameter. Given kk0 = m, the solution to (2.1)is the subset with the largest maximum likelihood among all subsets of size m.The model size m is then chosen to maximize (2.1) among p best subsets of sizesm (1 m p). Clearly, the computation of the penalized L0 problem is acombinational problem with NP-complexity.

  • VARIABLE SELECTION IN HIGH DIMENSIONAL FEATURE SPACE 107

    When the normal likelihood is used, (2.1) becomes penalized least squares.Many traditional methods can be regarded as penalized likelihood methods withdierent choices of . Let RSSd be the residual sum of squares of the best subsetwith d variables. Then Cp = RSSd=s2 + 2d n in Mallows (1973) correspondsto = 1, where s2 is the mean squared error of the full model. The adjusted R2

    given by

    R2adj = 1n 1n d

    RSSdSST

    also amounts to a penalized-L0 problem, where SST is the total sum of squares.Clearly maximizing R2adj is equivalent to minimizing log(RSSd=(n d)). ByRSSd=n 2 (the error variance), we have

    n logRSSdn d

    RSSd2

    + d+ n(log 2 1):

    This shows that the adjusted R2 method is approximately equivalent to PMLEwith = 1=2. Other examples include the generalized cross-validation (GCV)given by RSSd=(1 d=n)2, cross-validation (CV), and RIC in Foster and George(1994). See Bickel and Li (2006) for more discussions of regularization in statis-tics.

    3. Penalized Likelihood

    As demonstrated above, L0 regularization arises naturally in many classicalmodel selection methods. It gives a nice interpretation of best subset selec-tion and admits nice sampling properties (Barron, Birge and Massart (1999)).However, the computation is infeasible in high dimensional statistical endeavors.Other penalty functions should be used. This results in a generalized form

    n1`n()pX

    j=1

    p(jj j); (3.1)

    where `n() is the log-likelihood function and p() is a penalty function indexedby the regularization parameter 0. By maximizing the penalized likelihood(3.1), we hope to simultaneously select variables and estimate their associated re-gression coecients. In other words, those variables whose regression coecientsare estimated as zero are automatically deleted.

    A natural generalization of penalized L0-regression is penalized Lq-regression, called bridge regression in Frank and Friedman (1993), in whichp(jj) = jjq for 0 < q 2. This bridges the best subset section (penalizedL0) and ridge regression (penalized L2), including the L1-penalty as a speciccase. The non-negative garrote is introduced in Breiman (1995) for shrinkage

  • 108 JIANQING FAN AND JINCHI LV

    estimation and variable selection. Penalized L1-regression is called the LASSOby Tibshirani (1996) in the ordinary regression setting, and is now collectivelyreferred to as penalized L1-likelihood. Clearly, penalized L0-regression possessesthe variable selection feature, whereas penalized L2-regression does not. Whatkind of penalty functions are good for model selection?

    Fan and Li (2001) advocate penalty functions that give estimators with threeproperties.

    (1) Sparsity: The resulting estimator automatically sets small estimated coe-cients to zero to accomplish variable selection and reduce model complexity.

    (2) Unbiasedness: The resulting estimator is nearly unbiased, especially whenthe true coecient j is large, to reduce model bias.

    (3) Continuity: The resulting estimator is continuous in the data to reduce in-stability in model prediction (Breiman (1996)).

    They require the penalty function p(jj) to be nondecreasing in jj, and provideinsights into these properties. We rst consider the penalized least squares in acanonical form.

    3.1. Canonical regression model

    Consider the linear regression model

    y = X + "; (3.2)

    where X = (x1; : : : ;xn)T , y = (y1; : : : ; yn)T , and " is an n-dimensional noisevector. If " N(0; 2In), then the penalized likelihood (3.1) is equivalent, upto an ane transformation of the log-likelihood, to the penalized least squares(PLS) problem

    min2Rp

    (12nkyXk2 +

    pXj=1

    p(jj j)); (3.3)

    where k k denotes the L2-norm. Of course, the penalized least squares continuesto be applicable even when the noise does not follow a normal distribution.

    For the canonical linear model in which the design matrix multiplied byn1=2 is orthonormal (i.e., XTX = nIp), (3.3) reduces to the minimization of

    12nkyXbk2 + kb k2 + pX

    j=1

    p(jj j); (3.4)

    where b = n1XTy is the ordinary least squares estimate. Minimizing (3.4)becomes a componentwise regression problem. This leads to considering the

  • VARIABLE SELECTION IN HIGH DIMENSIONAL FEATURE SPACE 109

    univariate PLS problem

    ^(z) = argmin2R

    12(z )2 + p(jj)

    : (3.5)

    Antoniadis and Fan (2001) show that the PLS estimator ^(z) possesses the prop-erties:

    (1) sparsity if mint0ft+ p0(t)g > 0;(2) approximate unbiasedness if p0(t) = 0 for large t;(3) continuity if and only if argmint0ft+ p0(t)g = 0,where p(t) is nondecreasing and continuously dierentiable on [0;1), the func-tion tp0(t) is strictly unimodal on (0;1), and p0(t) means p0(0+) when t = 0for notational simplicity. In general for the penalty function, the singularity atthe origin (i.e., p0(0+) > 0) is needed for generating sparsity in variable selectionand the concavity is needed to reduce the estimation bias.

    3.2. Penalty function

    It is known that the convex Lq penalty with q > 1 does not satisfy thesparsity condition, whereas the convex L1 penalty does not satisfy the unbiased-ness condition, and the concave Lq penalty with 0 q < 1 does not satisfy thecontinuity condition. In other words, none of the Lq penalties satises all threeconditions simultaneously. For this reason, Fan (1997) and Fan and Li (2001)introduce the smoothly clipped absolute deviation (SCAD), whose derivative isgiven by

    p0(t) = I(t ) + (a t)+

    (a 1) I(t > )

    for some a > 2; (3.6)

    where p(0) = 0 and, often, a = 3:7 is used (suggested by a Bayesian argument).It satises the aforementioned three properties. A penalty of similar spirit is theminimax concave penalty (MCP) in Zhang (2009), whose derivative is given by

    p0(t) =(a t)+

    a: (3.7)

    Clearly SCAD takes o at the origin as the L1 penalty and then gradually levelso, and MCP translates the at part of the derivative of SCAD to the origin.When

    p(t) =12[2 ( t)2+]; (3.8)

    Antoniadis (1996) shows that the solution is the hard-thresholding estimator^H(z) = zI(jzj > ). A family of concave penalties that bridge the L0 and

  • 110 JIANQING FAN AND JINCHI LV

    Figure 2. Some commonly used penalty functions (left panel) and theirderivatives (right panel). They correspond to the risk functions shown inthe right panel of Figure 3. More precisely, = 2 for hard thresholdingpenalty, = 1:04 for L1-penalty, = 1:02 for SCAD with a = 3:7, and = 1:49 for MCP with a = 2.

    Figure 3. The risk functions for penalized least squares under the Gaussianmodel for the hard-thresholding penalty, L1-penalty, SCAD (a = 3:7), andMCP (a = 2). The left panel corresponds to = 1 and the right panelcorresponds to = 2 for the hard-thresholding estimator, and the rest ofparameters are chosen so that their risks are the same at the point = 3.

    L1 penalties was studied by Lv and Fan (2009) for model selection and sparserecovery. A linear combination of L1 and L2 penalties is called an elastic net byZou and Hastie (2005), which encourages some grouping eects. Figure 2 depictssome of those commonly used penalty functions.

    We now look at the PLS estimator ^(z) in (3.5) for a few penalties. Eachincreasing penalty function gives a shrinkage rule: j^(z)j jzj and ^(z) =

  • VARIABLE SELECTION IN HIGH DIMENSIONAL FEATURE SPACE 111

    sgn(z)j^(z)j (Antoniadis and Fan (2001)). The entropy penalty (L0 penalty)and the hard thresholding penalty yield the hard thresholding rule (Donoho andJohnstone (1994)), while the L1 penalty gives the soft thresholding rule (Bickel(1983) and Donoho and Johnstone (1994)). The SCAD and MCP give rise toanalytical solutions to (3.5), each of which is a linear spline in z (Fan (1997)).

    How do those thresholded-shrinkage estimators perform? To compare them,we compute their risks in the fundamental model in which Z N(; 1). LetR() = E(^(Z))2. Figure 3 shows the risk functions R() for some commonlyused penalty functions. To make them comparable, we chose = 1 and 2 for thehard thresholding penalty, and for other penalty functions the values of werechosen to make their risks at = 3 the same. Clearly the penalized likelihoodestimators improve the ordinary least squares estimator Z in the region where is near zero, and have the same risk as the ordinary least squares estimator when is far away from zero (e.g., 4 standard deviations away), except the LASSOestimator. When is large, the LASSO estimator has a bias approximately of size, and this causes higher risk as shown in Figure 3. When hard = 2, the LASSOestimator has higher risk than the SCAD estimator, except in a small region. Thebias of the LASSO estimator makes LASSO prefer a smaller . For hard = 1,the advantage of the LASSO estimator around zero is more pronounced. As aresult in model selection, when is automatically selected by a data-driven ruleto compensate the bias problem, the LASSO estimator has to choose a smaller in order to have a desired mean squared error. Yet, a smaller value of results ina complex model. This explains why the LASSO estimator tends to have manyfalse positive variables in the selected model.

    3.3. Computation and implementation

    It is challenging to solve the penalized likelihood problem (3.1) when thepenalty function p is nonconvex. Nevertheless, Fan and Lv (2009) are able togive the conditions under which the penalized likelihood estimator exists and isunique; see also Kim and Kwon (2009) for the results of penalized least squareswith SCAD penalty. When the L1-penalty is used, the objective function (3.1)is concave and hence convex optimization algorithms can be applied. We showin this section that the penalized likelihood (3.1) can be solved by a sequence ofreweighted penalized L1-regression problems via local linear approximation (Zouand Li (2008)).

    In the absence of other available algorithms at that time, Fan and Li (2001)propose a unied and eective local quadratic approximation (LQA) algorithmfor optimizing nonconcave penalized likelihood. Their idea is to locally approx-imate the objective function by a quadratic function. Specically, for a given

  • 112 JIANQING FAN AND JINCHI LV

    Figure 4. The local linear (dashed) and local quadratic (dotted) approxima-tions to the SCAD function (solid) with = 2 and a = 3:7 at a given pointjj = 4.

    initial value = (1 ; : : : ; p)T , the penalty function p can be locally approxi-mated by a quadratic function as

    p(jj j) p(jj j) +12p0(jj j)jj j

    [2j (j )2] for j j : (3.9)

    With this and a LQA to the log-likelihood, the penalized likelihood (3.1) becomesa least squares problem that admits a closed-form solution. To avoid numericalinstability, it sets the estimated coecient bj = 0 if j is very close to 0, whichamounts to deleting the j-th covariate from the nal model. Clearly the value 0is an absorbing state of LQA in the sense that once a coecient is set to zero, itremains zero in subsequent iterations.

    The convergence property of the LQA was studied in Hunter and Li (2005),who show that LQA plays the same role as the E-step in the EM algorithm inDempster, Laird and Rubin (1977). Therefore LQA has similar behavior to EM.Although the EM requires a full iteration for maximization after each E-step,the LQA updates the quadratic approximation at each step during the course ofiteration, which speeds up the convergence of the algorithm. The convergencerate of LQA is quadratic, which is the same as that of the modied EM algorithmin Lange (1995).

    A better approximation can be achieved by using the local linear approxi-mation (LLA):

    p(jj j) p(jj j) + p0(jj j)(jj j jj j) for j j ; (3.10)as in Zou and Li (2008). See Figure 4 for an illustration of the local linear andlocal quadratic approximations to the SCAD function. Clearly, both LLA and

  • VARIABLE SELECTION IN HIGH DIMENSIONAL FEATURE SPACE 113

    LQA are convex majorants of concave penalty function p() on [0;1), but LLAis a better approximation since it is the minimum (tightest) convex majorantof the concave function on [0;1). With LLA, the penalized likelihood (3.1)becomes

    n1`n()pX

    j=1

    wj jj j; (3.11)

    where the weights are wj = p0(jj j). Problem (3.11) is a concave optimizationproblem if the log-likelihood function is concave. Dierent penalty functionsgive dierent weighting schemes, and LASSO gives a constant weighting scheme.In this sense, the nonconcave penalized likelihood is an iteratively reweightedpenalized L1 regression. The weight function is chosen adaptively to reducethe biases due to penalization. For example, for SCAD and MCP, when theestimate of a particular component is large so that it has high condence to benon-vanishing, the component does not receive any penalty in (3.11), as desired.

    Zou (2006) proposes the weighting scheme wj = jj j for some > 0, andcalls the resulting procedure adaptive LASSO. This weight reduces the penaltywhen the previous estimate is large. However, the penalty at zero is innite.When the procedure is applied iteratively, zero becomes an absorbing state. Onthe other hand, the penalty functions such as SCAD and MCP do not have thisundesired property. For example, if the initial estimate is zero, then wj = andthe resulting estimate is the LASSO estimate.

    Fan and Li (2001), Zou (2006), and Zou and Li (2008) all suggest a consistentestimate such as the un-penalized MLE. This implicitly assumes that p n. Fordimensionality p that is larger than sample size n, the above method is notapplicable. Fan and Lv (2008) recommend using j = 0, which is equivalent tousing the LASSO estimate as the initial estimate. Another possible initial valueis to use a stepwise addition t or componentwise regression. They put forwardthe recommendation that only a few iterations are needed, which is in line withZou and Li (2008).

    Before we close this section, we remark that with the LLA and LQA, theresulting sequence of target values is always nondecreasing, which is a specic fea-ture of minorization-maximization (MM) algorithms (Hunter and Lange (2000)).Let p() =

    Ppj=1 p(jj j). Suppose that at the k-th iteration, p() is approx-

    imated by q() such that

    p() q() and p((k)) = q((k)); (3.12)where (k) is the estimate at the k-th iteration. Let (k+1) maximize the ap-proximated penalized likelihood n1`n() q(). Then we have

    n1`n((k+1)) p((k+1)) n1`n((k+1)) q((k+1))

  • 114 JIANQING FAN AND JINCHI LV

    n1`n((k)) q((k))= n1`n((k)) p((k)):

    Thus, the target values are non-decreasing. Clearly, the LLA and LQA aretwo specic cases of the MM algorithms, satisfying condition (3.12); see Figure4. Therefore, the sequence of target function values is non-decreasing and thusconverges provided it is bounded. The critical point is the global maximizerunder the conditions in Fan and Lv (2009).

    3.4. LARS and other algorithms

    As demonstrated in the previous section, the penalized least squares prob-lem (3.3) with an L1 penalty is fundamental to the computation of penalizedlikelihood estimation. There are several additional powerful algorithms for suchan endeavor. Osborne, Presnell and Turlach (2000) cast such a problem as aquadratic programming problem. Efron, Hastie, Johnstone and Tibshirani (2004)propose a fast and ecient least angle regression (LARS) algorithm for variableselection, a simple modication of which produces the entire LASSO solutionpath fb() : > 0g that optimizes (3.3). The computation is based on thefact that the LASSO solution path is piecewise linear in . See Rosset and Zhu(2007) for a more general account of the conditions under which the solution tothe penalized likelihood (3.1) is piecewise linear. The LARS algorithm startsfrom a large value of which selects only one covariate that has the greatestcorrelation with the response variable and decreases the value until the secondvariable is selected, at which the selected variables have the same correlation (inmagnitude) with the current working residual as the rst one, and so on. SeeEfron et al. (2004) for details.

    The idea of the LARS algorithm can be expanded to compute the solutionpaths of penalized least squares (3.3). Zhang (2009) introduces the PLUS algo-rithm for eciently computing a solution path of (3.3) when the penalty functionp() is a quadratic spline such as the SCAD and MCP. In addition, Zhang (2009)also shows that the solution path b() is piecewise linear in , and the proposedsolution path has desired statistical properties.

    For the penalized least squares problem (3.3), Fu (1998), Daubechies, Defriseand De Mol (2004), and Wu and Lang (2008) propose a coordinate descent algo-rithm, which iteratively optimizes (3.3) one component at a time. This algorithmcan also be applied to optimize the group LASSO (Antoniadis and Fan (2001)and Yuan and Lin (2006)) as shown in Meier, van de Geer and Buhlmann (2008),penalized precision matrix estimation (Friedman, Hastie and Tibshirani (2007)),and penalized likelihood (3.1) (Fan and Lv (2009) and Zhang and Li (2009)).

  • VARIABLE SELECTION IN HIGH DIMENSIONAL FEATURE SPACE 115

    More specically, Fan and Lv (2009) employ a path-following coordinate op-timization algorithm, called the iterative coordinate ascent (ICA) algorithm, formaximizing the nonconcave penalized likelihood. It successively maximizes thepenalized likelihood (3.1) for regularization parameters in decreasing order. Asimilar idea is also studied in Zhang and Li (2009), who introduce the ICM algo-rithm. The coordinate optimization algorithm uses the Gauss-Seidel method, i.e.,maximizing one coordinate at a time with successive displacements. Specically,for each coordinate within each iteration, it uses the second order approxima-tion of `n() at the p-vector from the previous step along that coordinate andmaximizes the univariate penalized quadratic approximation

    max2R

    12(z )2 p(jj)

    ; (3.13)

    where > 0. It updates each coordinate if the maximizer of the correspond-ing univariate penalized quadratic approximation makes the penalized likelihood(3.1) strictly increase. Therefore, the ICA algorithm enjoys the ascent propertythat the resulting sequence of values of the penalized likelihood is increasing for axed . Compared to other algorithms, the coordinate optimization algorithm isespecially appealing for large scale problems with both n and p large, thanks to itslow computational complexity. It is fast to implement when the univariate prob-lem (3.13) admits a closed-form solution. This is the case for many commonlyused penalty functions such as SCAD and MCP. In practical implementation, wepick a suciently large max such that the maximizer of the penalized likelihood(3.1) with = max is 0, and a decreasing sequence of regularization parameters.The studies in Fan and Lv (2009) show that the coordinate optimization worksequally well and eciently for producing the entire solution paths for concavepenalties.

    The LLA algorithm for computing penalized likelihood is now available in Rat

    http://cran.r-project.org/web/packages/SIS/index.html

    as a function in the SIS package. So is the PLUS algorithm for computing thepenalized least squares estimator with SCAD and MC+ penalties. The Matlabcodes are also available for the ICA algorithm for computing the solution pathof the penalized likelihood estimator and for computing SIS upon request.

    3.5. Composite quasi-likelihood

    The function `n() in (3.1) does not have to be the true likelihood. It canbe a quasi-likelihood or a loss function (Fan, Samworth and Wu (2009)). In most

  • 116 JIANQING FAN AND JINCHI LV

    statistical applications, it is of the form

    n1nXi=1

    Q(xTi ; yi)pX

    j=1

    p(jj j); (3.14)

    where Q(xTi ; yi) is the conditional quasi-likelihood of Yi given Xi. It can alsobe the loss function of using xTi to predict yi. In this case, the penalized quasi-likelihood (3.14) is written as the minimization of

    n1nXi=1

    L(xTi ; yi) +pX

    j=1

    p(jj j); (3.15)

    where L is a loss function. For example, the loss function can be a robustloss: L(x; y) = jy xj. How should we choose a quasi-likelihood to enhancethe eciency of procedure when the error distribution possibly deviates fromnormal?

    To illustrate the idea, consider the linear model (3.2). As long as the errordistribution of " is homoscedastic, xTi is, up to an additive constant, the con-ditional quantile of yi given xi. Therefore, can be estimated by the quantileregression

    nXi=1

    (yi b xTi );

    where (x) = x+ + (1 )x (Koenker and Bassett (1978)). Koenker (1984)proposes solving the weighted composite quantile regression by using dierentquantiles to improve the eciency, namely, minimizing with respect to b1; : : : ; bKand ,

    KXk=1

    wk

    nXi=1

    k(yi bk xTi ); (3.16)

    where fkg is a given sequence of quantiles and fwkg is a given sequence ofweights. Zou and Yuan (2008) propose the penalized composite quantile withequal weights to improve the eciency of the penalized least squares.

    Recently, Bradic, Fan and Wang (2009) proposed the more general compositequasi-likelihood

    KXk=1

    wk

    nXi=1

    Lk(xTi ; yi) +pX

    j=1

    p(jj j): (3.17)

    They derive the asymptotic normality of the estimator and choose the weightfunction to optimize the asymptotic variance. In this view, it always performsbetter than a single quasi-likelihood function. In particular, they study in detail

  • VARIABLE SELECTION IN HIGH DIMENSIONAL FEATURE SPACE 117

    the relative eciency of the composite L1-L2 loss and optimal composite quantileloss with the least squares estimator.

    Note that the composite likelihood (3.17) can be regarded as an approxima-tion to the log-likelihood function via

    log f(yjx) = log f(yjxT) KXk=1

    wkLk(xT; y)

    withPK

    k=1wk = 1. Hence, wk can also be chosen to minimize (3.17) directly.If the convexity of the composite likelihood is enforced, we need to impose theadditional constraint that all weights are non-negative.

    3.6. Choice of penalty parameters

    The choice of penalty parameters is of paramount importance in penalizedlikelihood estimation. When = 0, all variables are selected and the modelis even unidentiable when p > n. When = 1, if the penalty satiseslim!1 p(jj) = 1 for 6= 0, then none of the variables is selected. Theinteresting cases lie between these two extreme choices.

    The above discussion clearly indicates that governs the complexity of theselected model. A large value of tends to choose a simple model, whereas asmall value of inclines to a complex model. The estimation using a largervalue of tends to have smaller variance, whereas the estimation using a smallervalue of inclines to smaller modeling biases. The trade-o between the biasesand variances yields an optimal choice of . This is frequently done by using amulti-fold cross-validation.

    There are relatively few studies on the choice of penalty parameters. InWang, Li and Tsai (2007), it is shown that the model selected by generalizedcross-validation using the SCAD penalty contains all important variables, butwith nonzero probability includes some unimportant variables, and that themodel selected by using BIC achieves the model selection consistency and anoracle property. It is worth to point out that missing some true predictor causesmodel misspecication, as does misspecifying the family of distributions. A semi-Bayesian information criterion (SIC) is proposed by Lv and Liu (2008) to addressthis issue for model selection.

    4. Ultra-High Dimensional Variable Selection

    Variable selection in ultra-high dimensional feature space has become increas-ingly important in statistics, and calls for new or extended statistical method-ologies and theory. For example, in disease classication using microarray gene

  • 118 JIANQING FAN AND JINCHI LV

    expression data, the number of arrays is usually on the order of tens while thenumber of gene expression proles is on the order of tens of thousands; in thestudy of protein-protein interactions, the number of features can be on the or-der of millions while the sample size n can be on the order of thousands (see,e.g., Tibshirani, Hastie, Narasimhan and Chu (2003) and Fan and Ren (2006));the same order of magnitude occurs in genetic association studies between geno-types and phenotypes. In such problems, it is important to identify signicantfeatures (e.g., SNPs) contributing to the response and reliably predict certainclinical prognosis (e.g., survival time and cholesterol level). As mentioned in theintroduction, three important issues arise in such high dimensional statisticalendeavors: computational cost, statistical accuracy, and model interpretability.Existing variable selection techniques can become computationally intensive inultra-high dimensions.

    A natural idea is to reduce the dimensionality p from a large or huge scale(say, log p = O(na) for some a > 0) to a relatively large scale d (e.g., O(nb)for some b > 0) by a fast, reliable, and ecient method, so that well-developedvariable selection techniques can be applied to the reduced feature space. Thisprovides a powerful tool for variable selection in ultra-high dimensional featurespace. It addresses the aforementioned three issues when the variable screeningprocedures are capable of retaining all the important variables with asymptoticprobability one, the sure screening property introduced in Fan and Lv (2008).

    The above discussion suggests already a two-scale method for ultra-highdimensional variable selection problems: a crude large scale screening followed bya moderate scale selection. The idea is explicitly suggested by Fan and Lv (2008)and is illustrated by the schematic diagram in Figure 5. One can choose one ofmany popular screening techniques, as long as it possesses the sure screeningproperty. In the same vein, one can also select a preferred tool for moderatescale selection. The large-scale screening and moderate-scale selection can beiteratively applied, resulting in iterative sure independence screening (ISIS) (Fanand Lv (2008)). Its amelioration and extensions are given in Fan, Samworth andWu (2009), who also develop R and Matlab codes to facilitate the implementationin generalized linear models (McCullagh and Nelder (1989)).

    4.1. Sure independence screening

    Independence screening refers to ranking features according to marginal util-ity, namely, each feature is used independently as a predictor to decide its use-fulness for predicting the response. Sure independence screening (SIS) was intro-duced by Fan and Lv (2008) to reduce the computation in ultra-high dimensionalvariable selection: all important features are in the selected model with probabil-ity tending to 1 (Fan and Lv (2008)). An example of independence learning is the

  • VARIABLE SELECTION IN HIGH DIMENSIONAL FEATURE SPACE 119

    Figure 5. Illustration of ultra-high dimensional variable selection scheme.A large scale screening is rst used to screen out unimportant variablesand then a moderate-scale searching is applied to further select importantvariables. At both steps, one can choose a favorite method.

    correlation ranking proposed in Fan and Lv (2008) that ranks features accordingto the magnitude of its sample correlation with the response variable. More pre-cisely, let ! = (!1; : : : ; !p)T = XTy be a p-vector obtained by componentwiseregression, where we assume that each column of the n p design matrix X hasbeen standardized with mean zero and variance one. For any given dn, take theselected submodel to be

    cMd = f1 j p : j!j j is among the rst dn largest of allg: (4.1)This reduces the full model of size p n to a submodel with size dn, whichcan be less than n. Such correlation learning screens those variables that haveweak marginal correlations with the response. For classication problems withY = 1, the correlation ranking reduces to selecting features by using two-samplet-test statistics. See Section 4.2 for additional details.

    Other examples of independence learning include methods in microarray dataanalysis where a two-sample test is used to select signicant genes between thetreatment and control groups (Dudoit et al. (2003), Storey and Tibshirani (2003),Fan and Ren (2006), and Efron (2007)), feature ranking using a generalized cor-relation (Hall and Miller (2009a)), nonparametric learning under sparse additivemodels (Ravikumar, Laerty, Liu and Wasserman (2009)), and the method inHuang, Horowitz and Ma (2008) that uses the marginal bridge estimators for se-lecting variables in high dimensional sparse regression models. Hall, Titteringtonand Xue (2009) derive some independence learning rules using tilting methodsand empirical likelihood, and propose a bootstrap method to assess the delity offeature ranking. In particular, the false discovery rate (FDR) proposed by Ben-jamini and Hochberg (1995) is popularly used in multiple testing for controllingthe expected false positive rate. See also Efron, Tibshirani, Storey and Tusher(2001), Abramovich, Benjamini, Donoho and Johnstone (2006), Donoho and Jin(2006), and Clarke and Hall (2009).

  • 120 JIANQING FAN AND JINCHI LV

    We now discuss the sure screening property of correlation screening. LetM = f1 j p : j 6= 0g be the true underlying sparse model with nonsparsitysize s = jMj; the other p s variables can also be correlated with the responsevariable via the link to the predictors in the true model. Fan and Lv (2008)consider the case p n with log p = O(na) for some a 2 (0; 1 2), where isspecied below, and Gaussian noise " N(0; 2) for some > 0. They assumethat Var (Y ) = O(1), max() = O(n ),

    minj2M

    jj j cn and minj2M

    jCov (1j Y;Xj)j c;

    where = Cov (x), ; 0, c is a positive constant, and the p-dimensionalcovariate vector x has an elliptical distribution with the random matrix X1=2

    having a concentration property that holds for Gaussian distributions. For stud-ies on the extreme eigenvalues and limiting spectral distributions of large randommatrices, see, e.g., Silverstein (1985), Bai and Yin (1993), Bai (1999), Johnstone(2001), and Ledoux (2001, 2005).

    Under the above regularity conditions, Fan and Lv (2008) show that if 2+ < 1, then there exists some 2 (2 + ; 1) such that when dn n, we havefor some C > 0,

    P (M cMd) = 1OpeCn12= logn: (4.2)In particular, this sure screening property entails the sparsity of the model: s dn. It demonstrates that SIS can reduce exponentially high dimensionality toa relatively large scale dn n, while the reduced model cM still contains allthe important variables with an overwhelming probability. In practice, to beconservative we can choose d = n 1 or [n= log n]. Of course, one can also takenal model size d n. Clearly larger d means larger probability of includingthe true underlying sparse modelM in the nal model cMd. See Section 4.3 forfurther results on sure independence screening.

    When the dimensionality is reduced from a large scale p to a moderate scaled by applying a sure screening method such as correlation learning, the well-developed variable selection techniques, such as penalized least squares methods,can be applied to the reduced feature space. This is a powerful tool of SIS basedvariable selection methods. The sampling properties of these methods can beeasily obtained by combining the theory of SIS and penalization methods.

    4.2. Feature selection for classication

    Independence learning has also been widely used for feature selection in highdimensional classication problems. In this section we look at the specic set-ting of classication and continue the topic of independence learning for variable

  • VARIABLE SELECTION IN HIGH DIMENSIONAL FEATURE SPACE 121

    selection in Section 4.3. Consider the p-dimensional classication between twoclasses. For k 2 f1; 2g, let Xk1; : : : ;Xknk be i.i.d. p-dimensional observationsfrom the k-th class. Classication aims at nding a discriminant function (x)that classies new observations as accurately as possible. The classier () as-signs x to the class 1 if (x) 0 and class 2 otherwise.

    Many classication methods have been proposed in the literature. The bestclassier is the Fisher F (x) = (x )01(1 2) when the data are fromthe normal distribution with a common covariance matrix: Xki N(k;), fork = 1; 2, and = (1 + 2)=2. However, this method is hard to implementwhen dimensionality is high due to the diculty of estimating the unknowncovariance matrix . Hence, the independence rule that involves estimatingthe diagonal entries of the covariance matrix, with discriminant function (x) =(x )0D1(1 2), is frequently employed for the classication, where D =diagfg. For a survey of recent developments, see Fan, Fan and Wu (2010).

    Classical methods break down when the dimensionality is high. As demon-strated by Bickel and Levina (2004), the Fisher discrimination method no longerperforms well in high dimensional settings due to the diverging spectra and sin-gularity of the sample covariance matrix. They show that the independencerule overcomes these problems and outperforms the Fisher discriminant in highdimensional setting. However, in practical implementation such as tumor clas-sication using microarray data, one hopes to nd tens of genes that have highdiscriminative power. The independence rule does not possess the property offeature selection.

    The noise accumulation phenomenon is well-known in the regression setup,but has never been quantied in the classication problem until Fan and Fan(2008). They show that the diculty of high dimensional classication is intrin-sically caused by the existence of many noise features that do not contribute tothe reduction of classication error. For example, in linear discriminant analysisone needs to estimate the class mean vectors and covariance matrix. Althougheach parameter can be estimated accurately, aggregated estimation error can bevery large and can signicantly increase the misclassication rate.

    LetR0 be the common correlation matrix, max(R0) be its largest eigenvalue,and = 1 2. Consider the parameter space

    =(;) : 0D1 Cp; max(R0) b0; min

    1jp2j > 0

    ;

    where Cp and b0 are given constants, and 2j is the j-th diagonal element of. Note that Cp measures the strength of signals. Let ^ be the estimated dis-criminant function of the independence rule, obtained by plugging in the sample

  • 122 JIANQING FAN AND JINCHI LV

    estimates of and D. Ifpn1n2=(np)Cp ! D0 0, Fan and Fan (2008) demon-

    strate that the worst case classication error, W (^), over the parameter space converges:

    W (^) P! 1 D02pb0

    ; (4.3)

    where n = n1+n2 and () is the cumulative distribution function of the standardnormal random variable.

    The misclassication rate (4.3) relates to dimensionality in the term D0,which depends on Cp=

    pp. This quanties the tradeo between dimensionality p

    and the overall signal strength Cp. The signal Cp always increases with dimen-sionality. If the useful features are located at the rst s components, say, thenthe signals stop increasing when more than s features are used, yet the penaltyof using all features is

    pp. Clearly, using s features can perform much better

    than using all p features. The optimal number should be the one that minimizesCm=

    pm, where the Cm are the signals of the best subset S of m features, dened

    as SD1S S , where S and DS are the sub-vector and sub-matrix of and Dconstructed using variables in S. The result (4.3) also indicates that the inde-pendence rule works no better than random guessing due to noise accumulation,unless the signal levels are extremely high, say,

    pn=pCp B for some B > 0.

    Hall, Pittelkow and Ghosh (2008) show that if C2p=p!1, the classication errorgoes to zero for a distance-based classier, which is a specic result of Fan andFan (2008) with B =1.

    The above results reveal that dimensionality reduction is also very importantfor reducing misclassication rate. A popular class of dimensionality reductiontechniques is projection. See, for example, principal component analysis in Ghosh(2002) and Zou, Hastie and Tibshirani (2004); partial least squares in Huang andPan (2003), and Boulesteix (2004); and sliced inverse regression in Chiaromonteand Martinelli (2002), Antoniadis, Lambert-Lacroix and Leblanc (2003), andBura and Pfeier (2003). These projection methods attempt to nd directionsthat can result in small classication errors. In fact, the directions that theynd usually put much larger weights on features with large classication power,which is indeed a type of sparsity in the projection vector. Fan and Fan (2008)formally show that linear projection methods are likely to perform poorly unlessthe projection vector is sparse, namely, the eective number of selected featuresis small. This is due to the aforementioned noise accumulation when estimating1 and 2 in high dimensional problems. For formal results, see Theorem 2 inFan and Fan (2008). See also Tibshirani, Hastie, Narasimhan and Chu (2002),Donoho and Jin (2008), Hall, Park and Samworth (2008), Hall, Pittelkow andGhosh (2008), Hall and Chan (2009), Hall and Miller (2009b), and Jin (2009) forsome recent developments in high dimensional classications.

  • VARIABLE SELECTION IN HIGH DIMENSIONAL FEATURE SPACE 123

    To select important features, the two-sample t test is frequently employed(see, e.g., Tibshirani et al. (2003)). The two-sample t statistic for feature j is

    Tj =X1j X2jq

    S21j=n1 + S22j=n2

    ; j = 1; : : : ; p; (4.4)

    where Xkj and S2kj are the sample mean and variance of the j-th feature in classk. This is a specic example of independence learning, which ranks the featuresaccording to jTj j. Fan and Fan (2008) prove that when dimensionality p growsno faster than the exponential rate of the sample size, if the lowest signal levelis not too small, the two-sample t test can select all important features withprobability tending to 1. Their proof relies on the deviation results of the two-sample t-statistic. See, e.g., Hall (1987, 2006), Jing, Shao and Wang (2003), andCao (2007) for large deviation theory.

    Although the t test can correctly select all important features with proba-bility tending to 1 under some regularity conditions, the resulting choice is notnecessarily optimal, since the noise accumulation can exceed the signal accu-mulation for faint features. Therefore, it is necessary to further single out themost important features. To address this issue, Fan and Fan (2008) proposethe Features Annealed Independence Rule (FAIR). Instead of constructing theindependence rule using all features, FAIR selects the most important ones anduses them to construct an independence rule. To appreciate the idea of FAIR,rst note that the relative importance of features can be measured by jj j=j ,where j is the j-th component of = 1 2 and 2j is the common varianceof the j-th feature. If such oracle ranking information is available, then one canconstruct the independence rule using m features with the largest jj j=j , withoptimal value of m to be determined. In this case, the oracle classier takes theform

    ^(x) =pX

    j=1

    ^j(xj ^j)^2j 1fjj j=j>bg

    ;

    where b is a positive constant. It is easy to see that choosing the optimal mis equivalent to selecting the optimal b. However oracle information is usuallyunavailable, and one needs to learn it from the data. Observe that jj j=j canbe estimated by j^j j=^j , where the latter is in fact

    pn=(n1n2)jTj j, in which the

    pooled sample variance is used. This is indeed the same as ranking the feature byusing the correlation between the jth variable with the class response 1 whenn1 = n2 (Fan and Lv (2008)). Indeed, as pointed out by Hall, Titterington andXue (2008), this is always true if the response for the rst class is assigned as 1,whereas the response for the second class is assigned as n1=n2. Thus to mimick

  • 124 JIANQING FAN AND JINCHI LV

    the oracle, FAIR takes a slightly dierent form to adapt to the unknown signalstrength

    ^FAIR(x) =pX

    j=1

    ^j(xj ^j)^2j 1f

    pn=(n1n2)jTj j>bg

    : (4.5)

    It is clear from (4.5) that FAIR works the same way as if we rst sort the featuresby the absolute values of their t-statistics in descending order, then take out therst m features to construct a classier. The number of features is selected byminimizing the upper bound of the classication error:

    m^ = arg max1mp

    1^mmax

    n[Pm

    j=1 T2(j) +m(n1 n2)=n]2

    mn1n2 + n1n2Pm

    j=1 T2(j)

    ;

    where T 2(1) T 2(2) T 2(p) are the ordered squared t-statistics, and ^mmax isthe estimate of the largest eigenvalue of the correlation matrix Rm0 of the m mostsignicant features. Fan and Fan (2008) also derive the misclassication rates ofFAIR and demonstrate that it possesses an oracle property.

    4.3. Sure independence screening for generalized linear models

    Correlation learning cannot be directly applied to the case of discrete covari-ates such as genetic studies with dierent genotypes. The mathematical resultsand technical arguments in Fan and Lv (2008) rely heavily on the joint normalityassumptions. The natural question is how to screen variables in a more generalcontext, and whether the sure screening property continues to hold with a limitedfalse positive rate.

    Consider the generalized linear model (GLIM) with canonical link. That is,the conditional density is given by

    f(yjx) = expny(x) b((x)) + c(y)

    o; (4.6)

    for some known functions b(), c(), and (x) = xT. As we consider only variableselection on the mean regression function, we assume without loss of generalitythat the dispersion parameter = 1. As before, we assume that each variablehas been standardized with mean 0 and variance 1.

    For GLIM (4.6), the penalized likelihood (3.1) is

    n1nXi=1

    `(xTi ; yi)pX

    j=1

    p(jj j); (4.7)

  • VARIABLE SELECTION IN HIGH DIMENSIONAL FEATURE SPACE 125

    where `(; y) = b()y. The maximum marginal likelihood estimator (MMLE)^Mj is dened as the minimizer of the componentwise regression

    ^Mj = ( ^

    Mj;0;

    ^Mj ) = argmin0;j

    nXi=1

    `(0 + jXij ; Yi); (4.8)

    where Xij is the ith observation of the jth variable. This can be easily computedand its implementation is robust, avoiding numerical instability in ultra-highdimensional problems. The marginal estimator estimates the wrong object ofcourse, but its magnitude provides useful information for variable screening. Fanand Song (2009) select a set of variables whose marginal magnitude exceeds apredened threshold value n:

    cMn = n1 j p : j ^Mj j no; (4.9)This is equivalent to ranking features according to the magnitude of MMLEsfj ^Mj jg. To understand the utility of MMLE, we take the population version ofthe minimizer of the componentwise regression to be

    Mj =Mj;0;

    Mj

    T= argmin0;jE`(0 + jXj ; Y ):

    Fan and Song (2009) show that Mj = 0 if and only if Cov (Xj ; Y ) = 0, andunder some additional conditions if jCov (Xj ; Y )j c1n for j 2M?, for givenpositive constants c1 and , then there exists a constant c2 such that

    minj2M?

    jMj j c2n: (4.10)

    In words, as long as Xj and Y are somewhat marginally correlated with < 1=2,the marginal signal Mj is detectable. They prove further the sure screeningproperty:

    PM? cMn! 1 (4.11)

    (the convergence is exponentially fast) if n = c3n with a suciently small c3,and that only the size of non-sparse elements (not the dimensionality) mattersfor the purpose of sure screening property. For the Gaussian linear model (3.2)with sub-Gaussian covariate tails, the dimensionality can be as high as log p =o(n(12)=4), a weaker result than that in Fan and Lv (2008) in terms of conditionon p, but a stronger result in terms of the conditions on the covariates. For logisticregression with bounded covariates, such as genotypes, the dimensionality canbe as high as log p = o(n12).

  • 126 JIANQING FAN AND JINCHI LV

    The sure screening property (4.11) is only part of the story. For example,if n = 0 then all variables are selected and hence (4.11) holds. The questionis how large the size of the selected model size in (4.9) with n = c3n shouldbe. Under some regularity conditions, Fan and Song (2009) show that withprobability tending to one exponentially fast,

    jcMn j = Onn2max()o: (4.12)In words, the size of selected model depends on how large the thresholding pa-rameter n is, and how correlated the features are. It is of order O(n2+ ) ifmax() = O(n ). This is the same or somewhat stronger result than in Fan andLv (2008) in terms of selected model size, but holds for a much more general classof models. In particularly, there is no restrictions on and , or more generallymax().

    Fan and Song (2009) also study feature screening by using the marginallikelihood ratio test. Let L^0 = min0 n

    1Pni=1 `(0; Yi) and

    L^j = L^0 min0;j

    n1nXi=1

    `(0 + jXij ; Yi): (4.13)

    Rank the features according to the marginal utility fL^jg. Thus, select a set ofvariables bNn = n1 j pn : L^j no; (4.14)where n is a predened threshold value. Let L?j be the population counterpartof L^j . Then, the minimum signal minj2M L?j is of order O(n

    2), whereas theindividual noise L^j L?j = Op(n1=2). In words, when 1=4, the noise level islarger than the signal. This is the key technical challenge. By using the fact thatthe ranking is invariant to monotonic transformations, Fan and Song (2009) areable to show that with n = c4n2 for a suciently small c4 > 0,

    PnM bNn ; j bNn j O(n2max())o! 1:

    Thus the sure screening property holds with a limited size of the selected model.

    4.4. Reduction of false positive rate

    A screening method is usually a crude approach that results in many falsepositive variables. A simple idea of reducing the false positive rate is to applya resampling technique as proposed by Fan, Samworth and Wu (2009). Splitthe samples randomly into two halves and let A^1 and A^2 be the selected sets ofactive variables based on, respectively, the rst half and the second half of the

  • VARIABLE SELECTION IN HIGH DIMENSIONAL FEATURE SPACE 127

    sample. If A^1 and A^2 both have a sure screening property, so does the set A^.On the other hand, A^ = A^1 \ A^2 has many fewer falsely selected variables, asan unimportant variable has to be selected twice at random in the ultra-highdimensional space, which is very unlikely. Therefore, A^ reduces the number offalse positive variables.

    Write A for the set of active indices { that is, the set containing those indicesj for which j 6= 0 in the true model. Let d be the size of the selected sets A1and A2. Under some exchangeability conditions, Fan, Samworth and Wu (2009)demonstrate that

    Pj bA \Acj r dr2

    pjAjr

    1r!

    n2

    p jAjr; (4.15)

    where, for the second inequality, we require that d n (p jAj)1=2. In otherwords, the probability of selecting at least r inactive variables is very small whenn is small compared to p, such as for the situations discussed in the previous twosections.

    4.5. Iterative sure independence screening

    SIS uses only the marginal information of the covariates and its sure screeningproperty can fail when technical conditions are not satised. Fan and Lv (2008)point out three potential problems with SIS.

    (a) (False Negative) An important predictor that is marginally uncorrelated butjointly correlated with the response cannot be picked by SIS. An example ofthis has the covariate vector x jointly normal with equi-correlation , whileY depends on the covariates through

    xT? = X1 + +XJ JXJ+1:Clearly, XJ+1 is independent of xT? and hence Y , yet the regression co-ecient J can be much larger than for other variables. Such a hiddensignature variable cannot be picked by using independence learning, but ithas a dominant predictive power on Y .

    (b) (False Positive) Unimportant predictors that are highly correlated with theimportant predictors can have higher priority to be selected by SIS thanimportant predictors that are relatively weakly related to the response. Anillustrative example has

    Y = X0 +X1 + +XJ + ";where X0 is independent of the other variables which have a common corre-lation . Then Cov (Xj ; Y ) = J = J Cov (X0; Y ), for j = J + 1; : : : ; p, andX0 has the lowest priority to be selected.

  • 128 JIANQING FAN AND JINCHI LV

    (c) The issue of collinearity among the predictors adds diculty to the problemof variable selection.

    Translating (a) to microarray data analysis, a two-sample test can never pickup a hidden signature gene. Yet, missing the hidden signature gene can resultin very poor understanding of the molecular mechanism and in poor diseaseclassication. Fan and Lv (2008) address these issues by proposing an iterativeSIS (ISIS) that extends SIS and uses more fully the joint information of thecovariates. ISIS still maintains computational expediency.

    Fan, Samworth and Wu (2009) extend and improve the idea of ISIS from themultiple regression model to the more general loss function (3.15); this includes,in addition to the log-likelihood, the hinge loss L(x; y) = (1xy)+ and exponen-tial loss L(x; y) = exp(xy) in classication in which y takes values 1, amongothers. The -learning (Shen, Tseng, Zhang and Wong (2003)) can also be castin this framework. ISIS also allows variable deletion in the process of iteration.More generally, suppose that our objective is to nd a sparse to minimize

    n1nXi=1

    L(Yi;xTi ) +pX

    j=1

    p(jj j):

    The algorithm goes as follows.

    1. Apply an SIS such as (4.14) to pick a set A1 of indices of size k1, and thenemploy a penalized (pseudo)-likelihood method (3.14) to select a subset M1of these indices.

    2. (Large-scale screening) Instead of computing residuals as in Fan and Lv (2008),compute

    L(2)j = min

    0;M1 ;jn1

    nXi=1

    L(Yi; 0 + xTi;M1M1 +Xijj); (4.16)

    for j 62 M1, where xi;M1 is the sub-vector of xi consisting of those elements inM1. This measures the additional contribution of variable Xj in the presenceof variables xM1 . Pick k2 variables with the smallest fL(2)j ; j 62 M1g and letA2 be the resulting set.

    3. (Moderate-scale selection) Use penalized likelihood to obtain

    b2 = argmin0;M1 ;A2n

    1nXi=1

    L(Yi; 0 + xTi;M1M1 + xTi;A2A2)

    +X

    j2M1[A2p(jj j): (4.17)

  • VARIABLE SELECTION IN HIGH DIMENSIONAL FEATURE SPACE 129

    This gives new active indices M2 consisting of nonvanishing elements of b2.This step also deviates importantly from the approach in Fan and Lv (2008)even in the least squares case. It allows the procedure to delete variables fromthe previous selected variables M1.

    4. (Iteration) Iterate the above two steps until d (a prescribed number) variablesare recruited or M` =M`1.

    The nal estimate is then bM` . In implementation, Fan, Samworth and Wu(2009) choose k1 = b2d=3c, and thereafter at the r-th iteration, take kr = d jMr1j. This ensures that the iterated versions of SIS take at least two iterationsto terminate. The above method can be considered as an analogue of the leastsquares ISIS procedure (Fan and Lv (2008)) without explicit denition of theresiduals. Fan and Lv (2008) and Fan, Samworth andWu (2009) show empiricallythat the ISIS signicantly improves the performance of SIS even in the dicultcases described above.

    5. Sampling Properties of Penalized Least Squares

    The sampling properties of penalized likelihood estimation (3.1) have beenextensively studied, and a signicant amount of work has been contributed topenalized least squares (3.3). The theoretical studies can be mainly classiedinto four groups: persistence, consistency and selection consistency, the weakoracle property, and the oracle property (from weak to strong). Again, persis-tence means consistency of the risk (expected loss) of the estimated model, asopposed to consistency of the estimate of the parameter vector under some loss.Selection consistency means consistency of the selected model. By the weak or-acle property, we mean that the estimator enjoys the same sparsity as the oracleestimator with asymptotic probability one, and has consistency. The oracle prop-erty is stronger than the weak oracle property in that, in addition to the sparsityin the same sense and consistency, the estimator attains an information boundmimicking that of the oracle estimator. Results have revealed the behavior ofdierent penalty functions and the impact of dimensionality on high dimensionalvariable selection.

    5.1. Dantzig selector and its asymptotic equivalence to LASSO

    The L1 regularization (e.g., LASSO) has received much attention due to itsconvexity and encouraging sparsity solutions. The idea of using the L1 normcan be traced back to the introduction of convex relaxation for deconvolutionin Claerbout and Muir (1973), Taylor, Banks and McCoy (1979), and Santosaand Symes (1986). The use of the L1 penalty has been shown to have close

  • 130 JIANQING FAN AND JINCHI LV

    connections to other methods. For example, sparse approximation using an L1approach is shown in Girosi (1998) to be equivalent to support vector machines(Vapnik (1995)) for noiseless data. Another example is the asymptotic equiva-lence between the Dantzig selector (Candes and Tao (2007)) and LASSO.

    The L1 regularization has also been used in the Dantzig selector recentlyproposed by Candes and Tao (2007), which is dened as the solution to

    min kk1 subject ton1XT (yX)

    1 ; (5.1)

    where 0 is a regularization parameter. It was named after Dantzig becausethe convex optimization problem (5.1) can easily be recast as a linear program.Unlike the PLS (3.3) which uses the residual sum of squares as a measure ofgoodness of t, the Dantzig selector uses the L1 norm of the covariance vectorn1XT (yX), i.e., the maximum absolute covariance between a covariate andthe residual vector yX, for controlling the model tting. This L1 constraintcan be viewed as a relaxation of the normal equation

    XTy = XTX; (5.2)

    namely, nding the estimator that has the smallest L1-norm in the neighborhoodof the least squares estimate. A prominent feature of the Dantzig selector is itsnonasymptotic oracle inequalities under L2 loss. Consider the Gaussian linearregression model (3.2) with " N(0; 2In) for some > 0, and assume thateach covariate is standardized to have L2 norm

    pn (note that we changed the

    scale of X since it was assumed that each covariate has unit L2 norm in Candesand Tao (2007)). Under the uniform uncertainty principle (UUP) on the designmatrix X, a condition on the nite condition number for submatrices of X, theyshow that, with high probability, the Dantzig selector b mimics the risk of theoracle estimator up to log p, specically

    kb 0k2 Cr(2 log p)n2 +

    Xj2supp(0)

    20;j ^ 21=2

    ; (5.3)

    where 0 = (0;1; : : : ; 0;p)T is the vector of the true regression coecients, C issome positive constant, and p(2 log p)=n. Roughly speaking, the UUP con-dition (see also Donoho and Stark (1989) and Donoho and Huo (2001)) requiresthat all n d submatrices of X with d comparable to k0k0 are uniformly closeto orthonormal matrices, which can be stringent in high dimensions. See Fanand Lv (2008) and Cai and Lv (2007) for more discussions. The oracle inequality(5.3) does not infer much about the sparsity of the estimate.

  • VARIABLE SELECTION IN HIGH DIMENSIONAL FEATURE SPACE 131

    Shortly after the work on the Dantzig selector, it was observed that theDantzig selector and the LASSO share some similarities. Bickel, Ritov and Tsy-bakov (2008) present a theoretical comparison of the LASSO and the Dantzigselector in the general high dimensional nonparametric regression model. Undera sparsity scenario, Bickel et al. (2008) derive parallel oracle inequalities for theprediction risk for both methods, and establish the asymptotic equivalence ofthe LASSO estimator and the Dantzig selector. More specically, consider thenonparametric regression model

    y = f+ "; (5.4)

    where f = (f(x1); : : : ; f(xn))T with f an unknown p-variate function, and y,X = (x1; : : : ;xn)T , and " are the same as in (3.2). Let ff1; : : : ; fMg be a nitedictionary of p-variate functions. As pointed out in Bickel et al. (2008), fj 'scan be a collection of basis functions for approximating f , or estimators arisingfrom M dierent methods. For any = (1; : : : ; M )T , dene f =

    PMj=1 jfj .

    Then similarly to (3.3) and (5.1), the LASSO estimator bfL and Dantzig selec-tor bfD can be dened accordingly as f b

    L

    and f bD

    with bL and bD the cor-responding M -vectors of minimizers. In both formations, the empirical normkfjkn =

    qn1

    Pni=1 f

    2j (xi) of fj is incorporated as its scale. Bickel et al. (2008)

    show that under the restricted eigenvalue condition on the Gram matrix and someother regularity conditions, with signicant probability, the dierence betweenk bfDfk2n and k bfLfk2n is bounded by a product of three factors. The rst factors2=n corresponds to the prediction error rate in regression with s parameters,and the other two factors including logM reect the impact of a large numberof regressors. They further prove sparsity oracle inequalities for the predictionloss of both estimators. These inequalities entail that the distance between theprediction losses of the Dantzig selector and the LASSO estimator are of thesame order as the distances between them and their oracle approximations.

    Bickel et al. (2008) also consider the specic case of a linear model (3.2), say(5.4) with true regression function f = X0. If " N(0; 2In) and some regu-larity conditions hold, they show that, with large probability, the Lq estimationloss for 1 q 2 of the Dantzig selector bD is simultaneously given by

    kbD 0kqq Cq1 +r sm2(q1)s log pn q=2; (5.5)where s = k0k0, m s is associated with the strong restricted eigenvaluecondition on the design matrix X, and C is some positive constant. When q = 1,they prove (5.5) under a (weak) restricted eigenvalue condition that does not

  • 132 JIANQING FAN AND JINCHI LV

    involvem. Bickel et al. (2008) also derive similar inequalities to (5.5) with slightlydierent constants on the Lq estimation loss, for 1 q 2, of the LASSOestimator bL. These results demonstrate the approximate equivalence of theDantzig selector and the LASSO. The similarity between the Dantzig selectorand LASSO has also been discussed in Efron, Hastie and Tibshirani (2007).Lounici (2008) derives the L1 convergence rate and studies a sign concentrationproperty simultaneously for the LASSO estimator and the Dantzig selector undera mutual coherence condition.

    Note that the covariance vector n1XT (yX) in the formulation of Dantzigselector (5.1) is exactly the negative gradient of (2n)1kyXk2 in PLS (3.3).This in fact entails that the Dantzig selector and the LASSO estimator are identi-cal under some suitable conditions, provided that the same regularization param-eter is used in both methods. For example, Meinshausen, Rocha and Yu (2007)give a diagonal dominance condition of the p p matrix (XTX)1 that ensurestheir equivalence. This condition implicitly assumes p n. James, Radchenkoand Lv (2009) present a formal necessary and sucient condition, as well as eas-ily veriable sucient conditions ensuring the identical solution of the Dantzigselector and the LASSO estimator when the dimensionality p can exceed samplesize n.

    5.2. Model selection consistency of LASSO

    There is a huge literature devoted to studying the statistical properties ofLASSO and related methods. This L1 method as well as its variants have alsobeen extensively studied in such other areas as compressed sensing. For exam-ple, Greenshtein and Ritov (2004) show that under some regularity conditions theLASSO-type procedures are persistent under quadratic loss for dimensionality ofpolynomial growth, and Greenshtein (2006) extends the results to more generalloss functions. Meinshausen (2007) presents similar results for the LASSO fordimensionality of exponential growth and nite nonsparsity size, but its persis-tency rate is slower than that of a relaxed LASSO. For consistency and selectionconsistency results see Donoho, Elad and Temlyakov (2006), Meinshausen andBuhlmann (2006), Wainwright (2006), Zhao and Yu (2006), Bunea, Tsybakovand Wegkamp (2007), Bickel et al. (2008), van de Geer (2008), and Zhang andHuang (2008), among others.

    As mentioned in the previous section, consistency results for the LASSO holdunder some conditions on the design matrix. For the purpose of variable selection,we are also concerned with the sparsity of the estimator, particularly its modelselection consistency meaning that the estimator b has the same support asthe true regression coecients vector 0 with asymptotic probability one. Zhaoand Yu (2006) characterize the model selection consistency of the LASSO by

  • VARIABLE SELECTION IN HIGH DIMENSIONAL FEATURE SPACE 133

    studying a stronger but technically more convenient property of sign consistency:P (sgn(b) = sgn(0)) ! 1 as n ! 1. They show that the weak irrepresentablecondition XT2X1(XT1X1)1sgn(1)1 < 1 (5.6)is necessary for sign consistency of the LASSO, and the strong irrepresentablecondition, which requires that the left-hand side of (5.6) be uniformly boundedby a positive constant C < 1, is sucient for sign consistency of the LASSO,where 1 is the subvector of 0 on its support supp(0), and X1 and X2 denotethe submatrices of the np design matrix X formed by columns in supp(0) andits complement, respectively. See also Zou (2006) for the xed p case. However,the irrepresentable condition can become restrictive in high dimensions. SeeSection 5.4 for a simple illustrative example, because the same condition showsup in a related problem of sparse recovery by using L1 regularization. Thisdemonstrates that in high dimensions, the LASSO estimator can easily select aninconsistent model, which explains why the LASSO tends to include many falsepositive variables in the selected model.

    To establish the weak oracle property of the LASSO, in addition to thesparsity characterized above, we need its consistency. To this end, we usuallyneed the condition on the design matrix thatXT2X1(XT1X1)11 C (5.7)for some positive constant C < 1, which is stronger than the strong irrepre-sentable condition. It says that the L1-norm of the regression coecients of eachinactive variable regressed on s active variables must be uniformly bounded byC < 1. This shows that the capacity of the LASSO for selecting a consistentmodel is very limited, noticing also that the L1-norm of the regression coef-cients typically increase with s. See, e.g., Wainwright (2006). As discussedabove, condition (5.7) is a stringent condition in high dimensions for the LASSOestimator to enjoy the weak oracle property. The model selection consistency ofthe LASSO in the context of graphical models has been studied by Meinshausenand Buhlmann (2006), who consider Gaussian graphical models with polynomi-ally growing numbers of nodes.

    5.3. Oracle property

    What are the sampling properties of penalized least squares (3.3) and penal-ized likelihood estimation (3.1) when the penalty function p is no longer convex?The oracle property (Fan and Li (2001)) provides a nice conceptual frameworkfor understanding the statistical properties of high dimensional variable selectionmethods.

  • 134 JIANQING FAN AND JINCHI LV

    In a seminal paper, Fan and Li (2001) build the theoretical foundation ofnonconvex penalized least squares or, more generally, nonconcave penalized like-lihood for variable selection. They introduce the oracle property for model se-lection. An estimator b is said to have the oracle property if it enjoys sparsityin the sense that b2 = 0 with probability tending to 1 as n ! 1, and b1 at-tains an information bound mimicking that of the oracle estimator, where b1 andb2 are the subvectors of b formed by components in supp(0) and supp(0)c,

    respectively, while the oracle knows the true model supp(0) beforehand. Theoracle properties of penalized least squares estimators can be understood in themore general framework of penalized likelihood estimation. Fan and Li (2001)study the oracle properties of nonconcave penalized likelihood estimators in thenite-dimensional setting, and Fan and Peng (2004) extend their results to themoderate dimensional setting with p = o(n1=5) or o(n1=3).

    More specically, without loss of generality, assume that the true regres-sion coecients vector is 0 = (

    T1 ;

    T2 )

    T with 1 and 2 the subvectors ofnonsparse and sparse elements respectively: k1k0 = k0k0 and 2 = 0. Letan = kp0(j1j)k1 and bn = kp00(j1j)k1. Fan and Li (2001) and Fan and Peng(2004) show that, as long as an; bn = o(1), under some regularity conditions thereexists a local maximizer b to the penalized likelihood (3.1) such that

    kb 0k2 = OPpp(n1=2 + an): (5.8)This entails that choosing the regularization parameter with an = O(n1=2)gives a root-(n=p) consistent penalized likelihood estimator. In particular, thisis the case when the SCAD penalty is used if = o(min1js j0;j j), where1 = (0;1; : : : ; 0;s)T . Recently, Fan and Lv (2009) gave a sucient conditionunder which the solution is unique.

    Fan and Li (2001) and Fan and Peng (2004) further prove the oracle prop-erties of penalized likelihood estimators under some additional regularity condi-tions. Let = diagfp00(j1j)g and p(1) = sgn(1) p0(j1j), where denotesthe the Hadamard (componentwise) product. Assume that =o(min1js j0;j j),pn=p ! 1 as n ! 1, and the penalty function p satises lim infn!1

    lim inft!0+ p0(t)= > 0. They show that if p = o(n1=5), then with probability

    tending to 1 as n!1, the root-(n=p) consistent local maximizer b = (bT1 ; bT2 )Tsatises the following

    (a) (Sparsity) b2 = 0;(b) (Asymptotic normality)

    pnAnI

    1=21 (I1 +)[b1 1 + (I1 +)1p(1)] D! N(0;G); (5.9)

  • VARIABLE SELECTION IN HIGH DIMENSIONAL FEATURE SPACE 135

    where An is a q s matrix such that AnATn ! G, a q q symmetric positivedenite matrix, I1 = I(1) is the Fisher information matrix knowing the truemodel supp(0), and b1 is a subvector of b formed by components in supp(0).

    Consider a few penalties. For the SCAD penalty, the condition =o(min j1j) entails that both p(1) and vanish asymptotically. Therefore,the asymptotic normality (5.9) becomes

    pnAnI

    1=21 (b1 1) D! N(0;G); (5.10)

    which shows that b1 has the same asymptotic eciency as the MLE of 1 know-ing the true model in advance. This demonstrates that the resulting penalizedlikelihood estimator is as ecient as the oracle one. For the L1 penalty (LASSO),the root-(n=p) consistency of b requires = an = O(n1=2), whereas the oracleproperty requires

    pn=p ! 1 as n ! 1. However, these two conditions are

    incompatible, which suggests that the LASSO estimator generally does not havethe oracle property. This is intrinsically due to the fact that the L1 penalty doesnot satisfy the unbiasedness condition.

    It has indeed been shown in Zou (2006) that the LASSO estimator does nothave the oracle property even in the nite parameter setting. To address thebias issue of LASSO, he proposes the adaptive LASSO by using an adaptivelyweighted L1 penalty. More specically, the weight vector is jbj for some > 0with the power understood componentwise, where b is an initial root-n consis-tent estimator of 0. Since b is root-n consistent, the constructed weights canseparate important variables from unimportant ones. This is an attempt to in-troduce the SCAD-like penalty to reduce the biases. From (3.11), it can easily beseen that the adaptive LASSO is just a specic solution to penalized least squaresusing LLA. As a consequence, Zou (2006) shows that the adaptive LASSO hasthe oracle property under some regularity conditions. See also Zhang and Huang(2008).

    5.4. Additional properties of SCAD estimator

    In addition to the oracle properties outlined in the last section and alsoin Section 6.2, Kim, Choi and Oh (2008) and Kim and Kwon (2009) provideinsights into the SCAD estimator. They attempt to answer the question of whenthe oracle estimator ^

    ois a local minimizer of the penalized least squares with

    the SCAD penalty, when the SCAD estimator and the oracle estimator coincide,and how to check whether a local minimizer is a global minimizer. The rsttwo results are indeed stronger than the oracle property as they show that theSCAD estimator is the oracle estimator itself rather than just mimicking itsperformance.

  • 136 JIANQING FAN AND JINCHI LV

    Recall that all covariates have been standardized. The follow assumption isneeded.

    Condition A. The nonsparsity size is sn = O(nc1) for some 0 < c1 < 1, theminimum eignvalue of the correlation matrix of those active variables is boundedaway from zero, and the minimum signal min1jsn jj j > c3n(1c2)=2 for someconstant c2 2 (c1; 1].

    Under Condition A, Kim et al. (2008) prove that if E"2ki

  • VARIABLE SELECTION IN HIGH DIMENSIONAL FEATURE SPACE 137

    the role of penalty functions in sparse recovery can give a simplied view of therole of penalty functions in high dimensional variable selection as the noise levelapproaches zero. In particular, we see that concave penalties are advantageous insparse recovery, which is in line with the advocation of folded concave penaltiesfor variable selection as in Fan and Li (2001).

    Consider the noiseless case y = X0 of the linear model (3.2). The problemof sparse recovery aims to nd the sparsest possible solution

    argmin kk0 subject to y = X: (5.12)The solution to y = X is not unique when the n p matrix X has rank lessthan p, e.g., when p > n. See Donoho and Elad (2003) for a characterization ofthe identiability of the minimum L0 solution 0. Although by its nature, the L0penalty is the target penalty for sparse recovery, its computational complexitymakes it infeasible to implement in high dimensions. This motivated the use ofpenalties that are computationally tractable relaxations or approximations to theL0 penalty. In particular, the convex L1 penalty provides a nice convex relaxationand has attracted much attention. For properties of various L1 and relatedmethods see, for example, the Basis Pursuit in Chen, Donoho and Saunders(1999), Donoho and Elad (2003), Donoho (2004), Fuchs (2004), Candes and Tao(2005, 2006), Donoho, Elad and Temlyakov (2006), Tropp (2006), Candes, Wakinand Boyd (2008), and Cai, Xu and Zhang (2009).

    More generally, we can replace the L0 penalty in (5.12) by a penalty function() and consider the -regularization problem

    minpX

    j=1

    (jj j) subject to y = X: (5.13)

    This constrained optimization problem is closely related to the PLS in (3.3). Agreat deal of research has contributed to identifying conditions on X and 0 thatensure the L1=L0 equivalence, i.e., the L1-regularization (5.13) gives the samesolution 0. For example, Donoho (2004) contains deep results and shows thatthe individual equivalence of L1/L0 depends only on supp(0) and 0 on itssupport. See also Donoho and Huo (2001) and Donoho (2006b). In a recentwork, Lv and Fan (2009) present a sucient condition that ensures the =L0equivalence for concave penalties. They consider increasing and concave penaltyfunctions () with nite maximum concavity (curvature). The convex L1 penaltyfalls at the boundary of this class of penalty functions. Under these regularityconditions, they show that 0 is a local minimizer of (5.13) if there exists some 2 (0;minjs j0;j j) such that

    maxu2U

    kXT2X1(XT1X1)1uk1 < 0(0+); (5.14)

  • 138 JIANQING FAN AND JINCHI LV

    where U = fsgn(1) 0(jvj) : kv 1k1 g, the notation being that of theprevious two sections.

    When the L1 penalty is used, U contains a single point sgn(1) with sgn(1)understood componentwise. In this case, condition (5.14) becomes the weakirrepresentable condition (5.6). In fact the L1=L0 equivalence holds provided that(5.6) weakened to nonstrict inequality is satised. However, this condition canbecome restrictive in high dimensions. To appreciate this, look at an examplegiven in Lv and Fan (2009). Suppose that X1 = (x1; : : : ;xs) is orthonormal,y =

    Psj=1 0;jxj wit