consistency bounds and support recovery of d-stationary ... · learning, the authors of [17,18]...

Noname manuscript No.(will be inserted by the editor)

Consistency Bounds and Support Recovery of D-stationarySolutions of Sparse Sample Average Approximations ?

Miju Ahn

the date of receipt and acceptance should be inserted later

Abstract This paper studies properties of the d(irectional)-stationary solutionsof sparse sample average approximation (SAA) problems involving difference-of-convex (dc) sparsity functions under a deterministic setting. Such properties areinvestigated with respect to a vector which satisfies a verifiable assumption torelate the empirical SAA problem to the expectation minimization problem definedby an underlying data distribution. We derive bounds for the distance betweenthe two vectors and the difference of the model outcomes generated by them.Furthermore, the inclusion relationships between their supports, sets of nonzerovalued indices, are studied. We provide conditions under which the support of ad-stationary solution is contained within, and contains, the support of the vectorof interest; the first kind of inclusion can be shown for any arbitrarily given setof indices. Some of the results presented herein are generalization of the existingtheory for a specialized problem of `1-norm regularized least squares minimizationfor linear regression.

Keywords non-convex optimization, sparse learning, difference-of-convexprogram, directional stationarity

1 Introduction

Statistical models are often trained through the method of sample average ap-proximations (SAA) given loss and model functions such that the trained modelachieves minimal error on the available data points or samples. Such methodologyserves as a practical treatment to fit the ‘best’ model while exploiting the restric-tive information of limited data. Ideally, the ultimate goal of a statistical learningproblem is to find a model which makes robust predictions for unobserved datapoints, i.e., the model yields minimal prediction error with respect to an under-lying data-generating distribution. Obtaining the latter model involves solving an

? This work is derived and extended from the last chapter [1, Chapter 4] of the author’sPh. D. dissertation which was written under the supervision of Jong-Shi Pang.

Department of Engineering Management, Information, and Systems, Southern Methodist Uni-versity, Dallas, Texas, 75205, U.S.A. E-mail: [email protected]

2 Miju Ahn

optimization formulation in which the expectation of the composite of the lossand the model function is minimized with respect to the data population. Theconnection between the SAA and the expectation minimization problems havebeen extensively studied in the past. A convergence result is given by the classicalLaw of Large Numbers which states that, under some regularity conditions, a SAAinvolving a given objective function converges to the expectation of the functionas the sample size grows to infinity. In [25], convergence of the set of the optimalsolutions and the objective values of the approximation problem have been shownunder some conditions as the sample size increases.

With recent flurry of activities using sparsity functions for various kinds ofvariable selection problems, the sparse sample average approximation has receivedmuch attention in the literature. Sparse learning involves solving SAAs formulatedwith sparsity functions with the goal of training efficient and robust models bysetting insignificant components of the model parameters at exactly zero. In a nut-shell, most of the existing (univariate) sparsity functions surrogate the `0-function,

defined as | t |0 ,

{1 if t 6= 00 if t = 0

for a scalar variable t; such functionals serve as con-

tinuous replacements for the discrete function to avoid computational intractabilitycaused by solving discontinuous `0-formulations. One of the early work, a sparse

SAA of the least squares error function and the `1-norm, ‖u‖1 ,n∑i=1

|ui| for u ∈ Rn,

called Least Absolute Shrinkage Selection Operator (LASSO) [26] pioneered the fieldof sparse learning in the mid 1990s. The original work has been followed by theo-retical investigations for variations of the problem including statistical properties[5,31,14] and exact recovery [6,7], and computational studies [32,28,9]. More re-cently, there have been increasing evidence, provided by theory in the literatureand results of the empirical studies [19,30,27,12], that support use of non-convexsparsity functions over convex functionals. The migration to non-convex program-ming started in the early 2000s when Fan and Li pointed out that `1-norm does notsatisfy a set of desirable statistical properties identified for `0-surrogates, and pro-posed a non-convex function named Smoothly Clipped Absolute Deviation (SCAD) toformulate variable selection problems [11]. Independently, non-convex surrogatessuch as Minimax Concave Penalty (MCP) [29], logarithmic function [2], capped(or truncated) `1 [16], and transformed `1 [22,30] have been introduced for sparselearning formulations; see also [2] and references therein. All of the above are uni-variate functions that are symmetric about the vertical axis, concave on R+ (thussometimes referred as folded-concave penalties), and differentiable except at theorigin.

Involving the surrogates mentioned above, the connection between the sparseSAAs and the expectation minimization problem have been explored. The globaloptima of convex formulations, and the stationary solutions of non-convex formu-lations, of sparse SAAs have been studied in comparison to the (unattainable)optimal solutions of the underlying problem. In [21], the authors studied a SAAformulation where a convex function is regularized by a norm resulting a convexprogram. Based on stochastic variational inequality theory, the authors of [20]related the LASSO problem and corresponding expectation minimization prob-lem. Statistical properties of the global optimum of convex sparse SAAs have been

Title Suppressed Due to Excessive Length 3

studied in [4,13] for deterministic and randomized matrices. For non-convex sparselearning, the authors of [17,18] provide statistical analyses on the formulations in-cluding non-convex model and loss composites, and sparsity surrogates. Thus far,most of the existing theory in the literature are developed based on the (possiblyunique) global minimum of the underlying problem that such results remain onlyas theoretical investigations. In general, addressing empirical solutional propertiesof the non-convex formulations is challenging as the concept of local or globaloptimality is no longer applicable for practically achieved outcomes. Besides, in-volving mentioned non-convex surrogates in the learning program introduces non-differentiability in addition to non-convexity, thus for such formulations, conceptsof stationarity, instead of optimality, should be considered and properly selectedaccording to the structures of the problem.

This paper is motivated by advances in sparse learning problems formulatedwith non-convex `0-surrogates. Our theory is developed based on the concept ofdirectional stationarity defined by the directional derivative. Following a recentwork [2], which introduces a unified difference-of-convex (dc) formulation for sparseSAAs, and investigates optimality and sparsity of the d(irectional)-stationary so-lutions, we study properties of the stationary solutions of the dc program involvingunivariate sparsity functions. Through the current work, we attempt to narrow thegap between the minimizer-based theory in the literature and results of the practi-cal computations, and develop theory with computational considerations that arereadily applicable for empirical outcomes gained from practice. In this vein, westudy the properties of the d-stationary solutions with respect to a vector thatsatisfies a verifiable condition [to be defined in 2.2] instead of the global optimizerof the expectation minimization problem. Some of the results to be presented aregeneralizations of the theorems provided in [13, Chapter 11] for LASSO; the prob-lem is a special case of the sparse learning formulation to be introduced in Section2.1. We provide discussion for each result to be presented, compare with the ex-isting theory, and summarize contributions of the current work. We refer to theappendix for a summary of the existing theorems given in the provided reference.

2 Sparse SAA Formulations

Given a set of data points and a statistical model m : Rd × Rd → R, we want tofind a vector of model parameters w ∈ Rd which minimizes the model predictionerror over the general data population. Assume that every sample (x, y), consistingof feature information x ∈ Rd and an outcome value y ∈ R, is generated by anunknown distribution D. Ideally, we want to solve

minimizew∈Rd

IE (x,y)∼D[`(

(m(w, x ), y) ]

, L(w) (1)

where ` : R×R→ R is the loss function that controls fitness of the model to thegiven data. This problem is only theoretical since the expectation is taken over thehidden data distribution. In practice, we observe a finite number of samples thatare believed to be generated from the distribution and wish to extract the bestknowledge from the limited information. A statistical method that is designed to

4 Miju Ahn

perform such task is the sample average approximation which solves an approx-imation of the problem (1) exploiting the available samples. The approximationproblem is defined as

minimizew∈Rd

1

N

N∑s=1

`(

(m(w, xs ), ys), LN (w) (2)

where N is the number of data points. As solving problem (2) serves as a real-istic alternative to solve problem (1), the relationship between the two problemshave been studied through convergence of the objective functions, their objec-tive values, and sets of the optimal solutions. For sparse learning problems, theapproximation in (2) and the sparsity functions can be formulated together as op-timization problems of the following forms: a loss minimization subject to sparsityconstraints, a cardinality minimization subject to loss not exceeding a prescribedquantity, and a Lagrangian formulation where both loss and sparsity are simulta-neously minimized within the objective function. We concentrate on studying theLagrangian formulation for the rest of this work.

2.1 The Lagrangian formulation

The bi-criteria Lagrangian formulation for sparse representations is defined by thesample average approximation term LN (w) and a sparsity function P : Rd → R,

minimizew∈Rd

LN (w) + λN P (w), (3)

where the penalty parameter λN > 0 controls level of the model complexity,or the number of nonzero components, in the corresponding solution. Throughoutthis work, we assume that

(A0). The composite function `(

(m(w, x), y)

is convex and differentiable in

w for any given data point (x, y) ∈ Rd+1, and the sparsity function P (w) is adifference-of-convex function of the form given by

P (w) ,d∑j=1

cj |wj |︸︷︷︸gj(wj)︸︷︷︸g(w)

−d∑j=1

max1≤k≤Jj

hjk (wj)︸︷︷︸hj(wj)︸︷︷︸h(w)

for some integers Jj > 0 (4)

and for all j = 1, . . . , d, where each cj is a positive integer and each of hjk is aconvex and differentiable function.

Thus P (w) is the sum of univariate dc functions, denoted by p, such that

P (w) =d∑j=1

pj(wj) =d∑j=1

[gj(wj)−hj(wj)

]. With an aim to investigate univariate

dc surrogates, the dc representation splits embedded convexity and differentiabilityof such functions into the convex component g(w) and the concave componenth(w) respectively. The maximum structure in the latter term handles functionswith multiple non-differentiable points such as capped `1. We note that all `0-surrogates mentioned in the previous section can be written in the dc form (4);see Table 1 for the dc representations of the univariate surrogates.


Sparsity function Difference-of-convex representation for t ∈ R

SCAD p SCADa, λ> 0(t) , λ | t | −

0 if | t | ≤ λ

( | t | − λ )2

2 ( a− 1 )if λ ≤ | t | ≤ aλ

λ | t | −( a+ 1 )λ2

2if | t | ≥ aλ

MCP pMCPa, λ> 0(t) , 2λ | t | −

t2

aif | t | ≤ aλ

2λ | t | − aλ2 if | t | ≥ aλ

Capped `1 pCL1a> 0(t) ,

| t |a−max

(0,

t

a− 1, −

t

a− 1

)Transformed `1 pTL1

a> 0(t) ,a+ 1

a| t | −

(a+ 1

a| t | −

(a+ 1)| t |a+ | t |

)Logarithmic pLog

σ′, λ> 0(t) ,

λ

σ′ | t | − λ(| t |σ′ − log(| t |+ σ′) + log σ′

)Table 1: Difference-of-convex representations for univariate sparsity functions

2.2 Stationary solutions

Due to the employment of the dc surrogates in (3), the SAA problem is non-convexand non-differentiable. Thus achieving global, or even local, optimal solutions byany given algorithm is not guaranteed without having apriori knowledge. The d-stationary solutions, on the contrary, are desirable realistic stationary solutions andthe concept of d-stationarity serves as a necessary condition for local optimalityfor univariate sparsity functions for certain dc programs, i.e., there is no hope fora point to be a local optimum without being a d-stationary point. We define thepoint β as a d-stationary point of a function f if its directional derivative at β isnonnegative for any feasible direction d,

f ′(β; d) , limτ↓0

f(β + τd)− f(β)

τ≥ 0,

as the positive scalar τ approaches 0. Existing deterministic and randomized algo-rithms compute d-stationary solutions for certain kinds of non-convex programs[24,23], for which the problem (3) is included as a special case, showing thatthe concept is provably computable using practical algorithms. We note that d-stationarity implies criticality while the two kinds of the stationary solutions ofa dc function are equivalent if the concave component of the function is differen-tiable. [See [15] for the definition of a critical point and the difference-of-convexalgorithm which computes the solution kind.] For the rest of this work, we let wN

denote a d-stationary solution of the problem (3) such that

wN ∈{w∣∣ ∇LN (w)T d+ λN P ′(w; d) ≥ 0 for all d ∈ Rd

}.

Directional differentiability of the convex and non-convex sparsity surrogates men-tioned in the current paper can be verified, and some of their explicit expressionsare given in [10]. It has been shown that sparse SAA programs formulated withcertain non-convex sparsity functions possess convexity determined by functionalconstants of the model, loss and sparsity functions, and the weight parameter λN

6 Miju Ahn

[2, Proposition 3.1]. Thus d-stationary solutions of such problems achieve globaloptimality.

To analyze the connection between the empirical problem (3) and the theoreti-cal problem (1), we need to choose a vector of interest which is related to the latterproblem. The expectation minimization problem is convex and differentiable bythe assumption made on the composite function `(m(w, x), y ). Hence, if we haveaccess to the data distribution, a local minimum of the problem is a global mini-mizer provided that the function is bounded below. Despite the desirable property,such optimum is hardly achieved in practice. Instead of developing theory basedon such unattainable solution kind, we introduce the vector w∗ which satisfies thefollowing assumption.

(AN ). For a positive scalar ε, let w∗ satisfy ‖∇LN (w∗) ‖∞ ≤ ε for the givensample size N .

We point out that w∗ can be an identifiable point and, in particular, the globalminimum of the sample average approximation (without regularization) satisfiesthe assumption. The assumption is motivated by a probabilistic bound providedin [25, Theorem 7.77]. The theorem states that under suitable conditions, for anyε′ > 0, there exists positive constants α and β independent of N , such that

P(

sup‖d‖=1

∣∣∣∇LN (w)T d−∇L(w)T d∣∣∣ ≤ ε′ ) > 1− α e−N β

for any given vector w. If unique ground truth was to exist for the expectationminimization problem, the gradient of L at the solution is exactly zero hence theunderlying global optimum satisfies the assumption (AN ) with a high probability.[The probability approaches 1 as N goes to infinity.]

2.3 The LASSO formulation and existing results

As mentioned, this work is inspired by an existing work on a special case of (3),and we provide a brief summary of the problem setting and the results given in[13, Chapter 11] for convenient reference. Note that we refer to the above referencefor any LASSO settings and results to be mentioned hereafter. We formally definethe LASSO problem as

minimizew∈Rd

1

2N‖Y −Xw ‖22 + λLasso

N ‖w ‖1 (5)

where each row of X ∈ RN×d and each component of Y ∈ RN are the featureinformation vector xT and its outcome y respectively. The problem is a specialcase of the sparse learning program (3) where ` is a quadratic least squares error,m is a linear model, y = wT x, and P is a `1-norm. Let us denote wLasso asthe optimal solution of the problem (5). In the referenced chapter, the empiricalsolution is compared to the underlying ground truth w 0; the authors assume thatthe samples are generated from w 0 then perturbed by some random Gaussian noiseε ∈ RN , i.e., Y = X w 0+ε where each εi ∼ N (0, σ2) for 1 ≤ i ≤ N . Therein, boundsfor the distance between the two solutions and their model predictions measured


by the `2-norm, ‖u‖2 =

√n∑i=1

u2i for u ∈ Rn, are provided. The support recovery of

wLasso to achieve exactly same support as w 0, denoted as S 0 , { i | w 0i 6= 0 }, is

shown with high probability. We refer to the appendix for a full summary of thework.

3 The neighborhood of strong convexity

In general, strong convexity is not expected for sparse learning problems even ifthey are formulated only with convex functions. One example is the least squareserror minimization in high-dimensional linear regression, i.e., minimize

x

12‖Ax− b‖2

given a vectors b and a matrix A. If the matrix’s number of columns exceeds thenumber of rows, the hessian of the function, ATA, is rank deficient and strong con-vexity is absent even though the problem is convex. Understanding such property,an assumption of restricted strong convexity for the least squares error functionwithin a neighborhood has been imposed for existing LASSO analyses includingthe reference [13]. We introduce an analogous assumption for the composite ofgeneral model and loss functions with respect to a region derived based on theproperty of d-stationary solutions.

3.1 Some assumptions

For the regularization parameter λN and the sparsity function P , we assume:

(Aλ). For any scalar 0 < δ < 1, let δ λN ≥ε

min1≤j≤d

cj;

(As). For any jth component of w, the sparsity function pj satisfies

supwj

∣∣h′jk(wj)∣∣ ≤ cj where h′jk(wj) =

d hjk(wj)

dwjfor all 1 ≤ k ≤ Jj .

To satisfy the assumption, the value for λN can be assigned as follows: givenε, we select δ between 0 and 1 then choose λN according to the inequality. Theone-sided inequality in (Aλ) allows a wide range of values to be selected for theparameter. Nonetheless, a smaller λN is desired to obtain meaningful results asthe parameter is one of the factors that control the tightness of the bounds to bepresented.

The assumption (As) determines the qualification of the sparsity functions tobe involved in formulation (3). It can be interpreted that for each of the univariatesurrogates, we assume that the curvature of the convex component dominates overthe curvature of the concave component uniformly over all w. While such conditionmay seem to be restrictive, we investigate sparsity functions listed in Table 1 toverify applicability of the assumption. For the univariate dc functions of the formp(t) = c | t | − max

1≤k≤Jhk(t) for t ∈ R, we identify the coefficient of the convex term

8 Miju Ahn

and the negative of the derivative(s) of the concave term. The constants a, λ andσ′ are parameters given by the definition of each sparsity surrogates.

– Capped `1 with three maximands in the concave term (J = 3)

c =1

a, hcapL1 ′

a (t) ∈{

0, −1

a,

1

a

};

– MCP (J = 1)

c = 2λ, hMCP ′

a, λ (t) =

2 t

aif | t | ≤ aλ,

2λ sign(t) if | t | ≥ aλ;

– SCAD (J = 1)

c = λ, hSCAD ′

a, λ (t) =

0 if | t | ≤ λ,

| t | − λa− 1

sign(t) if λ ≤ | t | ≤ aλ,

λ sign(t) if | t | ≥ aλ.

For the above functions, we verify that supt

max1≤k≤J

∣∣h′k(t)∣∣ = c. The assumption is

also satisfied by the below single-piece sparsity surrogates.

– Logarithmic (J = 1)

c =λ

σ′, hLog ′

λ;σ′ (t) =λ t

σ′ (σ′ + | t | );

– Transformed `1 (J = 1)

c =a+ 1

a, hTL1 ′

a (t) =

[a+ 1

a− a ( a+ 1 )

( a+ | t | )2

]sign(t).

It is trivial that supt

∣∣h′(t) ∣∣ is strictly less than c for the last two functions,

validating that the assumption is suitable for all functions listed in Table 1.

3.2 Derivation of the region

We derive a region using the property of directional stationary solutions. The re-gion contains all vectors (w∗ − wN ) where wN is a d-stationary solution of theproblem (3). The set of described vectors defines the strong convexity assump-tion to be introduced at the end of this section. By definition of the d-stationarysolutions, wN satisfies L′N (wN ; w − wN ) + λNP

′(wN ; w − wN ) ≥ 0. Since LN isdifferentiable by (A0), the inequality can be written as

0 ≤ 1

N

N∑s=1

`′(m(wN , xs), ys )∇m( wN , xs︸︷︷︸,∇LN (wN )

)T (w∗ − wN ) + λN P ′ ( wN ; w∗ − wN )

(6)


with a substitution w = w∗. Due to the convexity of the composite term, we have

∇LN (wN )T (w∗ − wN ) ≤ ∇LN (w∗)T (w∗ − wN ) (7)

by the gradient inequality. By (A0), the directional derivative of the sparsity func-tion can be written as a difference of two directional derivatives.

P ′(wN ; w∗ − wN ) = g′(wN ; w∗ − wN )− h′(wN ; w∗ − wN )

≤ g(w∗)− g(wN )− h′(wN ; w∗ − wN ) (8)

where the inequality is given by convexity of g. Each function hj consists of themaximum of finitely many smooth and convex univariate functions hence we de-fine a set of active indices to identify the directional derivative. For a given com-ponent j, let JNj be the set of active indices defined as JNj , { k′ |hjk′(wNj ) =

hj(wNj ) for 1 ≤ k′ ≤ Jj}. The directional derivative of hj at wNj in the direction

dj is equal to maxk∈JNj

h′jk(wNj ; dj) by Danskin’s theorem.

Denote S as the support of w∗, i.e., S , { i | w∗i 6= 0 , for i = 1, . . . , d }. Wethen rewrite P ′(wN ; w∗− wN ) as a sum of univariate functions with respect to theset S and its complement. Continuing from (8),

= −∑j /∈S

[cj | wNj |+ max

k∈JNj

{− h′jk(wNj )wNj

}]

+∑j∈S

[cj |w∗j | − cj | w

Nj | − max

k∈JNjh′jk(wNj )(w∗j − w

Nj )

](9)

≤ −∑j /∈S

[cj + max

k∈JNj

(−|h′jk(wNj ) |

)]| wNj |+

∑j∈S

[cj + min

k∈JNj|h′jk(wNj ) |

]|w∗j − w

Nj |.

Applying the above derivations, the property of the d-stationarity given in (6)deduces to

∑j /∈S

[cj + max

k∈JNj

(−|h′jk(wNj ) |

)]| wNj |

≤∑j∈S

[cj + min


]|w∗j − w

Nj |+

1

λN∇LN (w∗)T (w∗ − wN )

≤∑j∈S

[cj + min


]|w∗j − w

Nj |+

1

λN‖∇LN (w∗) ‖∞ ‖w∗ − wN ‖1

(10)

where the latter inequality is achieved by applying Holder’s inequality. Now sup-pose that for some 0 < δ < δ′ < 1, we have

maxk∈JNj

(−|h′jk(wNj ) |

)≥ −cj ( δ′ − δ )

10 Miju Ahn

for every index j that is not in S. Then by assumptions stated thus far, namely(AN ), (Aλ) and (As), the inequality (10) deduces to∑

j /∈S

(1− δ′

)cj | wNj | ≤

∑j∈S

( 2 + δ ) cj |w∗j − wNj |.

3.3 Definition of the region

Summarizing the above derivation, we define the region for which the assumptionof restricted strong convexity to be imposed. For the given support set S of w∗,let δ and δ′ be scalars that satisfy 0 < δ < δ′ < 1 and define the set

Vδ, δ′ ,

v ∈ Rd∣∣∣∣ ∑j /∈S

cj | vj | ≤2 + δ

1− δ′∑j∈S

cj | vj |

. (11)

It is clear that for any fixed constants δ and δ′, the above set is a cone; i.e.,for α > 0, we have αu ∈ Vδ, δ′ for any u ∈ Vδ, δ′ . Specifically, it is a non-convexcone which can be shown by a simple 2-dimensional counterexample: Let c1 andc2 equal to 1, and S = {2}; We verify that u = (1, 1) and u′ = (7, 1) are both con-tained in V 0.2, 0.7 yet the convex combination of the two vectors, αu + (1 − α)u′

for α = 0.1 is not contained in the set. This violates the definition of a convex set.

In the referenced LASSO analysis, a neighborhood of strong convexity is ob-tained based on support of the ground truth and optimality of the solution of thesparse SAA problem. The region provided therein is defined as

VLasso ,

{v ∈ Rd

∣∣∣∣ ∑j /∈S0

| vj | ≤ 3∑j∈S0

| vj |}.

It is trivial that the two sets only differ by the coefficients appearing on the rightside of the inequalities if w∗ = w0. The inclusion relationships among the givensets are determined by the value of the quotient in (11) such that if

• 2 + δ

1− δ′ < 3 then Vδ, δ′ ⊂ VLasso;

• δ = 1− 3δ′ then the two sets are equivalent;

• 2 + δ

1− δ′ > 3 then Vδ, δ′ ⊃ VLasso.

The above set inclusions with respect to δ and δ′ are illustrated in part (a) ofFigure 1. In a δ-δ′ plane, a collection of the pairs feasible for Vδ, δ′ is shown as ashaded region, and a set of pairs that corresponds to VLasso are shown as a dashedline segment. For the case of δ′ = 2δ, we illustrate three 2-dimensional examplesof the region Vδ, 2δ to show that it is possible to choose values for the parameterssuch that Vδ, 2δ contains (or is contained in) VLasso. In Figure 1 (b), we comparethe LASSO region which corresponds to δ = 1

7 , with V 13, 2

3and V 1

100, 2

100and show


that the former region (for δ = 13 ) is a superset and the latter region (for δ = 1

100 )is a subset of VLasso.

δ′

δ

( 14

, 14

)

0 1

1

( 13

,0)

VLasso

VLasso ⊃ Vδ, δ′ VLasso ⊂ Vδ, δ′

v1

v2

VLasso

V 13, 23

V0.01, 0.02

Fig. 1: (a) Division of δ′-δ space according to the relationship of Vδ, δ′ (shaded) and VLasso

(dashed line); (b) Examples of Vδ, 2δ for δ = 1100

(dashed line), δ = 17

(LASSO, solid line),

δ = 13

(no line); S = {2}

3.4 Restricted strong convexity assumption

Based on the derived region, we state the assumption of restricted strong convexityon the sample average approximation term LN (w):

(ARSC). There exists γ`min > 0 such that

γ`min ‖w∗ − w ‖22 ≤ LN (w∗)− LN (w)−∇LN (w)T (w∗ − w )

for all w for which (w∗ − w) ∈ Vδ, δ′ .

We recall that the set Vδ, δ′ contains all vectors which are the difference betweenthe vector w∗ and the d-stationary solutions of the problem (3). Thus (ARSC)indicates that the sample average of loss and model composite functions is stronglyconvex in the neighborhood of the vector w∗ determined by Vδ, δ′ . [An analogousassumption imposed for LASSO is given in (29) in the appendix.]

4 Properties of the D-stationary Solution

We provide bounds on the d-stationary solutions and their model outcomes incomparison to w∗. If w∗ is the optimal solution of the SAA problem (2), the effectsof regularization can be estimated by the bounds. If w∗ is the underlying groundtruth of the expectation minimization problem, then the results serve as measuresto understand how the empirically achieved solutions and their model outcomesdeviate from the hidden optimal solution(’s). The support recovery result showsunder which circumstances wN recovers the support of w∗.

12 Miju Ahn

4.1 Bound on the distance w∗ − wN

The following theorem bounds the distance between wN and w∗ in terms of the`2-norm.

Theorem 1 Let assumptions (A0), (AN ), (Aλ), (As), and (ARSC) hold. Let wN be

a directional stationary solution of (3). If, for some 0 < δ < δ′ < 1, it holds that

maxk∈JNj

(−|h′jk(wNj ) |

)≥ −cj ( δ′ − δ ) for every j /∈ S (12)

where S is the support of w∗, then

‖w∗ − wN ‖2 ≤3

γ`min

λN max1≤j≤d

cj√|S| . (13)

Proof. We start from the restricted strong convexity assumption (ARSC):

γ`min ‖w∗ − wN ‖22 ≤ LN (w∗)− LN (wN )−∇LN (wN )T (w∗ − wN )

since (w∗ − wN ) ∈ Vδ, δ′

= LN (w∗)− LN (wN ) + λN P ′(wN ; w∗ − wN )

−∇LN (wN )T (w∗ − wN )− λN P ′(wN ; w∗ − wN ).

By convexity of LN and the d-stationarity of wN , the above deduces to

≤ ∇LN (w∗)T (w∗ − wN ) + λN P ′(wN ; w∗ − wN ) and

≤∑j /∈S

[ε− λN cj − λN max

k∈JNj

(−|h′jk(wNj ) |

)] ∣∣∣ wNj ∣∣∣+∑j∈S

[ε+ λN cj + λN min


] ∣∣∣w∗j − wNj ∣∣∣by inequality (10) and assumption (AN ). We verify that the first term in the lastexpression is strictly negative by assumptions (Aλ) and (As), and by condition ofthe theorem. Therefore the above continues as

≤ 3λN max1≤j≤d

cj∑j∈S

∣∣∣w∗j − wNj ∣∣∣≤ 3λN max

1≤j≤dcj√|S| ‖w∗ − wN ‖2 (14)

where the last inequality is obtained by applying the properties of the vectornorm.

The condition of the theorem requires for wNj to satisfy a strict inequality,

mink∈JNj

|h′jk(wNj ) | < cj , for the components that are not in the support of w∗. This

condition is automatically satisfied if a weighted `1-norm is employed as a spar-sity function as h(w) = 0 in such case. To further understand the condition, we


examine piecewise dc sparsity functions. We verify that the condition only holdsif | wNj | < aλ for both SCAD and MCP where a and λ are positive parameters for

these functions. For the piecewise linear capped `1, | wNj | ≤1

ais the necessary and

sufficient condition for (12) to hold. Thus for the above piecewise `0-surrogates,the condition can be interpreted as that we want the value of wNj to be near theorigin if the corresponding component of w∗ is at zero. In the discussion providedin Section 3, we verified that the logarithmic and transformed `1 functions alwaysyield a strict inequality regardless of the value of the input.

We examine components in the right side of the bound (13). The strong convex-ity modulus γ`min is inversely proportional to the bound, indicating that, assumingγ`min > 1, if the neighborhood of w∗ is more strongly convex then the achievedbound is tighter. If the vector w∗ is the optimal solution of the expectation min-imization problem (1), the quantity ‖∇LN (w∗) ‖∞ estimates the error caused bysample average approximation since ∇L(w∗) = 0. Such measurement is expectedto appear in the bound that compares the underlying solution with a solution ob-tained by solving an approximated problem. In the provided bound, the quantityλN max

1≤j≤dcj implicitly involves such error since the quantity is strictly greater than

‖∇LN (w∗) ‖∞ by assumptions (AN ) and (Aλ). For the case the vector w∗ is theoptimal solution of the sample average approximation (2), the assumption (Aλ)deduces to δ λN ≥ 0 thus it is possible to always choose the λN such that thebound (13) is simplified.

4.1.1 Comparison with LASSO consistency bound

The bound (13) achieves the LASSO consistency bound [13, Theorem 11.1(b)]with cj = 1 for all j. See (30) in the appendix for the exact statement of thetheorem. The contribution of our result includes extending the special problemto a formulation consisting of general convex and differentiable model and losscomposite, and dc sparsity functions. Our bound is derived based on the stationarysolution of the non-convex program without requiring unrealistic properties suchas optimality of the stationary solution. Since condition (12) of the Theorem 1 isnot needed for LASSO, it remains to compare the assumptions imposed on theregularization parameters λLasso

N and λN . In the LASSO consistency result, it is

required that λLassoN ≥ 2 ‖XT ε ‖∞

Nwhere ε = Y −X w 0 is a Gaussian noise vector.

By applying the LASSO settings to our problem and letting w∗ = w 0, we have

‖ LN (w∗) ‖∞ =

∥∥∥∥ 1

NXT (Y −X w∗)

∥∥∥∥∞

=1

N‖XT ε ‖∞.

By assumptions (AN ) and (Aλ), we have

δ λN ≥ ε ≥ ‖ LN (w∗) ‖∞, (15)

thus the condition on λLassoN is a special case of (15) where δ is selected as

1

2. Hence

we show that Theorem 1 is a complete generalization of the LASSO consistencybound.

14 Miju Ahn

4.1.2 For the case p is linear near the origin

For a dc univariate surrogate that is defined by linear functions near the origin,the decomposition given in (4) implies that the slope of the concave componentis zero if the input is near the origin. For such functions, let ξ denote a positivescalar such that h′(t) = 0 for all | t | ≤ ξ. Among the functions listed in Table 1,

capped `1 and SCAD have the described property with ξ =1

aand ξ = λ for given

parameters a and λ, respectively. We present a specialized result of the bound(13).

Corollary 1 Let assumptions (A0), (AN ), (Aλ), (As) hold. Moreover, let assumption

(ARSC) hold with respect to some 0 < δ < 1. Let wN be a directional stationary

solution of a special case of problem (3) where P ,d∑j=1

p(wj) satisfies h′(wj) = 0 for

all |wj | ≤ ξ for some positive ξ. If

| wNj | ≤ ξ for every j /∈ S (16)

where S is the support of w∗, then

‖w∗ − wN ‖2 ≤2 + δ

γ`min

λN max1≤j≤d

cj√|S| . (17)

Note that (16) implies the condition given in the Theorem 1. Thus we relaxthe requirement on the parameters δ and δ′, which define the region of restrictedstrong convexity, to a condition on the stationary solution. Clearly, the relaxationintroduces freedom of choosing the value of the parameters. Hence we achieve atighter bound, applicable for capped `1 and SCAD, for the specialized problemwhile imposing the strong convexity assumption over any desired Vδ, δ′ .

4.2 Bounds on the prediction error

Through SAAs, statistical models are trained in such way that their outcomesyield minimal error with respect to the response values in the given set of samples.While such models may achieve a desired level of accuracy for the training data,e.g., overfitting, it is unclear whether the models would continue to show superiorperformances when new (unobserved) data points arrive. To evaluate predictionability of the model trained through SAA, we compare the model outcome with an-other set of predictions given by w∗. We examine the quantity m(w∗, •)−m(wN , •),referred as the prediction error, and derive a bound on the difference between themodel outcomes generated by wN and w∗. The described quantity is only mean-ingful for regression problems since an outcome for a given point predicted by aclassification model is typically a probability of the point belonging to a certainclass. Thus the scope of prediction error bounds to be presented in current sectionremains within the regression settings.

We recall that (A0) assumes the composite function `(m(β, x), y ) is a smoothand convex for any given (x, y). The assumption does not directly reveal anyproperties of the model function. As exploiting properties of a given model m is


essential to derive a prediction bound, the challenge of developing such analysisinvolves deciding what kind of structures one assumes for the function. Beforeimposing any further assumptions, we examine the following special case.

4.2.1 The case of least squares linear regression

Consider the sparse learning problem (3) where the loss and model composite is aleast squares error of a linear model given by

LlseN (w) ,

1

2N

N∑s=1

(wT xs − ys )2 =1

2N‖Xw − Y ‖22. (18)

Let w∗ satisfy the LASSO assumption such that the samples are generated byw∗ then perturbed by a random noise vector ε ∈ RN following any distribution.The restricted strong convexity assumption for the approximation term (18) thenyields

γ`min‖w∗ − wN ‖22 ≤ LlseN (w∗)− Llse

N (wN )−∇LlseN (wN )T (w∗ − wN )

for any d-stationary solution wN

=1

2N‖Xw∗ − Y ‖22 −

1

2N‖XwN − Y ‖22

− 1

N(w∗ − wN )TXT (XwN − Y )

=1

N‖X(w∗ − wN ) ‖22 since Y = Xw∗ + ε

≤ 3λN max1≤j≤d

cj√|S| ‖w∗ − wN ‖2 (19)

where the last inequality is given by (14). The linearity of the model plays a keyrole for simplifying the right side expression in the first line to a term involvingthe quantity Xw∗ − XwN which measures the prediction error between the twomodels. With the above derivation, we present the prediction error bound for thespecial case of the problem (3).

Corollary 2 Let Y = Xw∗ + ε where ε is a random noise vector, and assumptions

(A0), (AN ), (Aλ), (As), and (ARSC) hold. Let wN be a d-stationary solution of the

problem (3) where LN (w) is of the form (18). If condition (12) holds, then

1

N‖X(wN − w∗) ‖22 ≤

9

γ`min

λ2N

(max1≤j≤d

cj

)2

|S |. (20)

Proof. From inequality (19), it follows that(1

N‖X(w∗ − wN ) ‖22

)2

≤

(3λN max

1≤j≤dcj√|S|

)2

‖w∗ − wN ‖22

≤

(3λN max

1≤j≤dcj√|S|

)21

γ`minN‖X(w∗ − wN ) ‖22.

16 Miju Ahn

Remark. The result generalizes the LASSO prediction error bound [13, Theo-rem 11.2] to a bound on the d-stationary solutions of the non-convex least squareserror SAA problem involving dc sparsity functions.

A natural question is whether it is possible for general nonlinear models toachieve the same bound and show a complete generalization of the LASSO predic-tion error bound. The key step in achieving the above result involves examiningthe right side expression of the restricted strong convexity assumption by explic-itly expressing the terms using definition of the model function. We inspect theexpression for the special case where a general m is combined with a quadraticloss to show the importance of the model structure:

LN (w∗)− LN (wN )−∇LN (wN )T (w∗ − wN )

=1

N

N∑s=1

[m(w∗, xs)−m(wN , xs)

]2(21)

− 2

N

N∑s=1

(ys −m(wN , xs)

) [m(w∗, xs)−m(wN , xs)−∇m (wN , xs)T (w∗ − wN )

].

The second term in (21) appears to be redundant as the prediction error isrepresented by the first term only. The example clearly shows that even with alimited choice of a loss function, the bound (20) is unlikely to achieve withouttaking advantage of linearity of the model. Additional conditions such as a boundon the training error or strong convexity of the model can be imposed to handlethe undesired term. Nevertheless, such additional restrictions may introduce atrade-off of loosening the bound.

4.2.2 The case of Lipschitz model functions

Consider a univariate quadratic model m(w, t) = w t2 for t, w ∈ R. Suppose wewant to bound the prediction error between the models given by w′′ and w′ ontwo samples ta and tb. The error is quantified as

1

2

[(w′′ ta

2 − w′ ta2 ) + (w′′ tb2 − w′ tb2 )

]=

1

2(w′′ − w′ )( ta2 + tb

2 ),

and the quantity can not be bounded above by a constant even if ‖w′′ − w′ ‖ isbounded. The simple example indicates that boundedness of the model is requiredin order to derive the prediction error bound. For the current section, we assumethat the model is a Lipschitz function:

(Am). For any given sample xs, the model m : Rd ×Rd → R satisfies

|m(w∗, xs)−m(w, xs) | ≤ Lipms‖w∗ − w ‖2 for some Lipms

> 0

for all w such that (w∗ − w) ∈ Vδ, δ′ .

By the property of Lipschitz functions, any function with a bounded gradientsatisfies the assumption. When m is a linear model, the Lipschitz constant Lipms isequal to ‖xs ‖2 showing that the constant may depend on data points. With aboveassumption, we present the prediction error bound for Lipschitz model functions.


Proposition 1 Let assumptions (A0), (AN ), (Aλ), (As), (ARSC), and (Am) hold. If

condition (12) holds, then any directional stationary solution wN of the problem (3)satisfies

1

N

N∑s=1

∣∣∣m(w∗, xs)−m(wN , xs)∣∣∣ ≤ 3

γ`min

λN max1≤j≤d

cj√|S |Lm (22)

where Lm is defined as1

N

N∑s=1

Lipms .

The proof is straightforward hence omitted. By assuming Lipschitz property onthe model, we derive the bound without requiring any further structures on the lossfunction. Assuming that the samples are independent and identically distributed,the constant Lm represents the sample expected value of the Lipschitz constant ofthe model function. For the case of linear model, we verify that Lm is the average`2-norm of all sample feature vectors.

4.3 Support recovery

To achieve a complete support recovery of a given d-stationary solution wN forthe support of w∗, we need to show the following properties.

– No false inclusion: support(wN ) ⊆ support(w∗);– No false exclusion: support(wN ) ⊇ support(w∗).

No false inclusion means that whenever w∗j = 0 for some component j, then wNj =0 for the corresponding component. Conversely, no false exclusion means thatwhenever w∗j 6= 0, then wNj 6= 0. The following theorem presents variable selection

consistency of the d-stationary solution wN . Denote wS as the subvector of w thatcorresponds to the set S.

Theorem 2 Let assumptions (A0), (AN ), (Aλ), (As), and (ARSC) hold. Let wN be

a d-stationary solution of problem (3), then the following holds:

1. If it holds that

1

N λN

N∑s=1

∣∣∣∣∣∂ `(m( wN , xs ), ys

)∂ wj

∣∣∣∣∣+ mink∈JNj

|h′jk(wNj ) | < cj for all j /∈ S (23)

then the support of wN is contained in S;

2. Let condition (12) be satisfied. If[ 3

γ`min

λN max1≤j≤d

cj√|S|]2

︸︷︷︸B2 (λN ,γ`min)

>∑j /∈S

( wNj )2 then

‖ wNS − w∗S ‖∞ ≤

√B2 (λN , γ

`min)−

∑j /∈S

( wNj )2

︸︷︷︸B

,

thus if minj∈S|w∗j | > B then S is contained in the support of wN .

18 Miju Ahn

Remark. If condition (23) holds with respect to a given subset of indices, thenno false inclusion of wN holds for the set.

Proof. We show the first part by contradiction starting from the definition of ad-stationary solution. Suppose (23) holds and the support of wN is not containedin S. By replacing P ′(wN ; w∗ − wN ) with the quantity shown in inequality (9),the definition of a d-stationary solution yields

0 ≤ ∇LN (wN )T (w − wN ) + λN

d∑j=1

[cj ( |wj | − | wNj | )− max

k∈JNjh′jk(wNj ; wj − wNj )

](24)

=1

N

N∑s=1

d∑j=1

∂`(m( wN , xs ), ys

)∂wj

(wj − wNj )

+ λN

d∑j=1

[cj ( |wj | − | wNj | )− max

k∈JNjh′jk(wNj ; wj − wNj )

]

for any vector w. Let us choose w such that

wj =

{wNj for j ∈ S0 for j /∈ S.

We continue from inequality (24) by substituting the components of w,

=∑j /∈S

[1

N

N∑s=1

∂`(m( wN , xs), ys

)∂wj

(−wNj ) + λN

(−cj | wNj | − max

k∈JNjh′jk(wNj ;−wNj )

)]

≤∑j /∈S

[1

N

N∑s=1

∣∣∣∣∣∂`(m(wN , xs ), ys

)∂wj

∣∣∣∣∣ |wNj |+ λN

(−cj |wNj |+ min

k∈JNj|h′jk(wNj )| |wNj |

)].

The above derivation shows that for any j-th component that is not in S, thecorresponding summand is equivalent to[

1

N

N∑s=1

∣∣∣∣∣∂ `(m( wN , xs ), ys

)∂ wj

∣∣∣∣∣+ λN

(−cj + min


)]| wNj |.

Therefore we must have wNj = 0 for all j /∈ S by condition (24) showing that any

index j corresponding to a nonzero valued wNj must be in S. The second partfollows from the bound (13) and properties of the vector norm.

As remarked, no false inclusion of a d-stationary solution can be shown for anygiven subset of indices. To be clear, the support of wN is known by the time wecompute the solution, and the support recovery with respect to a known set ofindices is already verified then. In such case, the first part of the theorem ratherserves as a quantitative statement which formally expresses structural propertiesof the problem in order to achieve such inclusion. We can interpret that the in-equality in (23) is satisfied if ∇LN (wN ) is sufficiently small, indicating that wN


is close to w∗ by (AN ), and if the values of | wNj | are near the origin wheneverw∗j = 0 for all j /∈ S; we discussed about the latter interpretation in Section 4.1 forpiecewise sparsity functions.

The above support recovery analysis takes an independent approach for itsderivation and the result from the variable selection consistency in [13, Theo-rem 11.3] which involves probabilistic statements. Also, we point out that thestationary solution wN may not be unique in general, moreover, it is possible fortwo d-stationary solutions of a problem to be supported on different sets of indices.Another important difference between our work and the existing work is that weprovide deterministic analysis for all the results shown thus far.

4.3.1 The case of least squares linear regression

To complete the comparison for the support recovery analysis, we consider thespecial case of problem (3) where the composite function is of the form (18).Furthermore, we assume that random noise is embedded in the samples and thecolumns of matrix X are normalized. This allows us to provide a probabilisticstatement that is comparable to the variable selection consistency result of theLASSO problem.

Theorem 3 Let Y = Xw∗+ε where εi ∼ N (0, σ2) for all i = 1, . . . , N , and assump-

tions (A0), (AN ), (Aλ), (As), and (ARSC) hold. In addition, let λN satisfy

λN >

√log |S |N

.

Let wN be a d-stationary solution of the problem (3), where LN (w) is of the form

(18), that satisfies condition (23). Then with probability greater than 1− c1 e−c2N λ2N

for some positive scalars c1 and c2,

‖ wNS − w∗S ‖∞ ≤ λN

√|S |

[√2σ2

γ`min

+

max1≤j≤d

cj

γ`min

].︸︷︷︸

B

(25)

Hence if minj∈S|w∗j | is strictly greater than B, the support of wN is consistent with S.

Proof of the theorem follows discussion. Provided that the support of the d-stationary solution is contained in S, we are able to compare the achieved `∞-bound with the existing result. We point out that the theorem does not imposemutual incoherence assumption, shown in (33), and achieves generalization of theexisting bound with an additional factor of

√|S |. This is due to the property of

directional stationarity which is defined by the sum of the directional derivativesat each component of a vector. Such structure limits to investigate individual sum-mands thus the `∞-norm is derived from norms that contain all components ofw∗ − wN .

By involving the distribution of the random noise, we achieve a bound quanti-fied by the standard deviation of the random variable. We observe that for any σ

20 Miju Ahn

sufficiently small, the above result achieves a bound tighter than the bound givenin Theorem 2 with high probability provided that condition (23) is satisfied. Addi-

tionally, we verify that the above holds for σ < max1≤j≤d

cj

√2

γ`min

without requiring

condition (12).

Proof. No false inclusion is shown from Theorem 2. Since w∗j − wNj = 0 for all j

not in S,

− 1

N(w∗S − w

NS )TXT

SXS(w∗S − wNS )− 1

NεTXS(w∗S − w

NS ) +λNP

′(wNS ;w∗S − wNS ) ≥ 0,

(26)by the property of wN . Observe that

P ′(wNS ;w∗S − wNS ) =

∑j∈S

[cj | • |′(wNj ; w∗j − w

Nj )− max

k∈JNjh′jk(wNj ) (w∗j − w

Nj )]

=∑j∈S,wNj 6=0

[cj sign (wNj )− max

k∈JNjh′jk(wNj )

](w∗j − w

Nj ) +

∑j∈S,wNj =0

cj w∗j

(27)

since | • |′(0;±1) = 1 and h′j(0) = 0 for all j

≤ max1≤j≤d

cj∑j∈S|w∗j − w

Nj |

since each value inside the bracket in (27) is between 0 and cj , inclusive, by as-sumption (As).

Thus the inequality (26) deduces to

1

N(w∗S − w

NS )TXT

SXS(w∗S − wNS ) ≤ − 1

NεTXS (w∗S − w

NS ) + λN max

1≤j≤dcj ‖w∗S − w

NS ‖1.

By applying Holder’s inequality and assumption (ARSC), we continue to have

γ`min ‖w∗S − wNS ‖

22 ≤

∥∥∥∥ 1

NXTS ε

∥∥∥∥2

‖w∗S − wNS ‖2 + λN max

1≤j≤dcj ‖w∗S − w

NS ‖1

≤{∥∥∥∥ 1

NXTS ε

∥∥∥∥2

+ λN max1≤j≤d

cj√|S |

}‖w∗S − w

NS ‖2

which yields

‖w∗S − wNS ‖2 ≤

∥∥∥∥∥√|S |

γ`minNXTS ε

∥∥∥∥∥∞

+ λN max1≤j≤d

cj1

γ`min

√|S |

by properties of the vector norm. To bound the first term in the right side of theinequality, consider

Zj , eTj

√|S |

γ`minNXTS ε


for all j = 1, . . . , |S |. Since each εi ∼ N (0, σ2) for i = 1, . . . , N , the random

variable Zj is zero-mean Gaussian with varianceσ2√|S |

γ`minN. By applying union

bound and Gaussian tail bound, we deduce

P(‖w∗S − w

NS ‖∞ > t

)≤ 2 exp

(− t2 γ`minN

2σ2√|S |

+ log |S |

).

This is due to the following derivation:

P(t < ‖w∗S − w

NS ‖∞

)≤ P

(t < max

1≤j≤|S ||Zj |

)≤ P

((t < |Z1 |

)or . . . or

(t < |Z|S | |

))≤ |S | P

(t < |Z1 |

)by union bound

≤ |S | 2 exp

(− t2 γ`minN

2σ2√|S |

)by Gaussian tail bound

= 2 exp

(− t2 γ`minN

2σ2√|S |

+ log |S |

).

Let t = σ λN

√2 |S |γ`min

, then by the choice of λN , we have λ2N N√|S | > log |S |

assuming |S | ≥ 1. Thus with probability at least 1 − c1ec2 λ2N N , we achieve the

bound (25).

Acknowledgement: The author gratefully acknowledges Jong-Shi Pang for hisinvolvement in fruitful discussions, and for providing valuable ideas that helpedto build the foundation of this work.

References

1. M. Ahn. Difference-of-convex learning: optimization with non-convex sparsity functions.University of Southern California (2018).

2. M. Ahn, J.S. Pang and J. Xin. Difference of convex learning: directional stationarity,optimality and sparsity. SIAM Journal on Optimization 27(3): 1637-1665 (2017).

3. D.P. Bertsekas. Nonlinear Programming. Second Edition. Athena Scientific (Belmont1999).

4. P. Buhlmann and S. van de Geer. Statistics for High-dimensional Data. Springer Seriesin Statistics (2011).

5. B.J. Bickel, Y. Ritov and A.B. Tsybakov. Simultaneous analysis of LASSO andDantzig selector. The Annals of Statistics 37(4): 1705–1732 (2009).

6. E. Candes and T. Tao. Decoding by linear programming. IEEE Transactions Informa-tion Theory 51(12): 4203–4216 (2005).

7. E. Candes and T. Tao. Near optimal signal recovery from random projections: universalencoding strategies. IEEE Transactions on Information Theory 52(12): 5406–5425 (2006).

8. E. Candes and T. Tao. The Dantzig selector: statistical estimation when p is muchlarger than n. Annals of Statistics 35(6): 2313-2351 (2007).

9. E. Candes, M. Wakin and S. Boyd. Enhancing sparsity by reweighted `1 minimization.Journal of Fourier Analysis and Applications 14(5): 877–905 (2008).

22 Miju Ahn

10. H. Dong, M. Ahn and J.S. Pang. Structural properties of affine sparsity constraints.Mathematical Programming Series B http://doi.org/10.1007/s10107-018-1283-3.(2018).

11. J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracleproperties. Journal of American Statistical Association 96(456): 1348–1360 (2001).

12. J. Fan and J. Lv Nonconcave penalized likelihood with NP-dimensionality. IEEE Trans-actions on Information Theory 57(8): 5467–5484 (2011).

13. T. Hastie, R. Tibshirani and M. Wainwright. Statistical Learning with Sparsity: TheLasso and Generalizations. CRC Press Taylor & Francis Group (Boca Raton 2015).

14. K. Knight and W. Fu. Asymptotics for lasso-type estimators. The Annals of Statistics28(5): 1356–1378 (2000).

15. H.A. Le Thi and D.T. Pham. The DC programming and DCA revised with DC models ofreal world nonconvex optimization problems. Annals of Operations Research 133: 25–46(2005).

16. H.A. Le Thi, D.T. Pham and X.T. Vo. DC approximation approaches for sparse opti-mization. European Journal of Operations Research 244: 26–46 (2015).

17. P. Loh and M. Wainwright. Regularized M-estimators with nonconvexity: statisticaland algorithmic theory for local optima. Journal on Machine Learning Research 16: 559–616 (2015).

18. P. Loh and M. Wainwright. Support recovery without incoherence: a case for noncon-vex regularization. Annals of Statistics 45(6): 2455–2482 (2017).

19. Y. Lou, P. Yin and J. Xin. Point source super-resolution via nonconvex L1 based meth-ods. Journal of Scientific Computing 68(3): 1082–1100 (2016).

20. S. Lu, Y. Liu, L. Yin and K. Zhang. Confidence intervals and regions for the lasso byusing stochastic variational inequality techniques in optimization. Journal of the RoyalStatistical Society Series B 79(2): 589–611 (2017).

21. S. Negahban, P. Ravikumar, M. Wainwright and B. Yu. A unified framework forhigh-dimensional analysis of M-estimators with decomposable regularizers. StatisticalSciense 27(4): 538–557 (2012).

22. M. Nikolova. Local strong homogeneity of a regularized sstimator. SIAM Journal onApplied Mathematics 61(2): 633–658 (2000).

23. J.S. Pang, M. Razaviyayn and A. Alvarado. Computing B-stationary points of nons-mooth DC programs. Mathematics of Operations Research 42(1): 95–118 (2017).

24. J.S. Pang and M. Tao. Decomposition methods for computing directional stationarysolutions of a class of nonsmooth nonconvex optimization problems. SIAM Journal onOptimization 28(2): 1640–1669 (2018).

25. A. Shapiro, D. Dentcheva and A. Ruszczynski. Lectures on Stochastic Programming:Modeling and Theory. SIAM Publications (Philadelphia 2009).

26. R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the RoyalStatistical Society 58(1): 267–288 (1996).

27. P. Yin, Y. Lou, Q. He and J. Xin. Minimization of L1-L2 for compressed sensing. SIAMJournal on Scientific Computing 37(1): 536–563 (2015).

28. W. Yin, S. Osher, D. Goldfarb and J. Darbon. Bregman iterative algorithms forL1-minimization with applications to compressed Sensing. SIAM Journal on ImagingSciences 1(1): 143–168 (2008).

29. C. Zhang. Nearly unbiased variable selection under Minimax Concave Penalty. The An-nals of Statistics 38(2): 894–942 (2010).

30. S. Zhang and J. Xin. Minimization of transformed L1 penalty: theory, difference ofconvex function algorithm, and robust application in compressed sensing. MathematicalProgramming, Series B 169(1): 307–336 (2018).

31. P. Zhao and B. Yu. On model selection consistency of LASSO. Journal of MachineLearning Research 7: 2541–2563 (2006).

32. H. Zou. The adaptive Lasso and its oracle properties. Journal of the American StatisticalAssociation 101(476): 1418–1429 (2006).


Appendix

This section is to provide a literature review of one particular reference whichpresented a chapter on the statistical inferences analysis for the LASSO problem[13, Chapter 11]. We formally define the LASSO problem:

minimizew∈Rd

1

2N‖Y −Xw ‖22 + λLasso

N ‖w ‖1 , θLassoN (w) (28)

where each row of X ∈ Rn×d and Y ∈ Rn×1 are the feature information sam-ple xT and its outcome y respectively. Being a convex program, any local min-imizer of LASSO is a global optimum as proven from the convex optimizationtheory. Exploiting the fact, LASSO analysis provided in the reference comparesthe optimal solutions of the `1-regularized sample average approximation prob-lem (28), denoted by wLasso, with the underlying ground truth vector w 0. Theassumption on w 0 is that all attained samples are generated from the vector thenperturbed by some random Gaussian noise, i.e., Y = X w 0 + ε where each compo-nent εi ∼ N (0, σ2) for 1 ≤ i ≤ N . Moreover, the authors assume that the groundtruth is a sparse vector with its nonzero components defining the support setS 0 , { i | w 0

i 6= 0 for 1 ≤ i ≤ d }.

The topics the authors address in the referenced work are the following: howclose the empirical solution is to the ground truth; if the ultimate goal of theproblem is to make future predictions, is it possible to compare the empiricalmodel outputs with the noise-free outputs produced by the ground truth on theavailable samples; and whether the empirical solution can recover the indices ofnonzero components that are contained in S 0. To answer these questions, theauthors derive a region which contains all vectors of the difference between thesolutions of (28) and w 0 is contained in the set. They define

VLasso , { v ∈ Rd∣∣ ‖ vSc0 ‖1 ≤ 3 ‖ vS 0

‖1},

where Sc0 is the complement of S 0. Therein, the authors assume that θLassoN is

strongly convex at the point w 0 with respect to VLasso, making a connection be-tween the empirical solution and the ground truth by utilizing the region. Though(28) is a convex program, strong convexity of the entire objective function cannot be expected in general. Such assumption guarantees that the submatrix of theHessian, XTX, corresponding to the indices in S 0 has a full rank. The statement ofthe restricted eigenvalues assumption, analogous to restricted strong convexity for thespecial case of least squares error minimization for linear regression, is as follows:there exists a constant γLasso > 0 such that

1N v

TXTXv

‖ v ‖22≥ γLasso for all nonzero v ∈ VLasso. (29)

We list the results provided in the reference: a basic consistency bound, abound on the prediction error, and the support recovery of wLasso. The assump-tions imposed for each theorem and the key ideas for the proof will be discussed.

24 Miju Ahn

• Consistency result [13, Theorem 11.1]: Suppose the model matrix X satisfiesthe restricted eigenvalue bound (29) with respect to the set VLasso. Given a regu-

larization parameter λLassoN ≥ 2

N‖XT ε ‖∞ > 0, any solution wLasso of (28) satisfies

the bound

‖ wLasso − w 0‖2 ≤3

γLasso

√|S0 |N

√N λLasso

N . (30)

Exploiting the fact that wLasso is the global minimizer of the LASSO problem,the proof of the theorem starts from θLasso

N (wLasso) ≤ θLassoN (w 0). We substitute

the assumption on the ground truth, Y = X w 0 + ε, to both sides of the inequality,then apply the assumption on the regularization parameter λLasso

N . These stepsyield a key inequality given by,

‖X(wLasso − w 0) ‖222N

≤ 3

2

√|S0 |λLasso

N ‖ wLasso − w 0 ‖2, (31)

which serves as a building block to derive the current theorem and the predictionerror bound to be shown. It can be verified that by letting v = wLasso − w 0, theproof is complete provided that the restricted eigenvalue condition holds; the laststep requires a lemma which shows that any error wLasso−w 0 associated with theLASSO solution wLasso belongs to the set VLasso if the condition on λLasso

N holds.

• Bounds on the prediction error [13, Theorem 11.2]: Suppose the matrix X

satisfies the restricted eigenvalue condition (29) over the set VLasso. Given a regu-

larization parameter λLassoN ≥ 2

N‖XT ε ‖∞ > 0, any solution wLasso of (28) satisfies

the bound‖X(wLasso − w 0) ‖22

N≤ 9

γLasso|S0 | (λLasso

N )2. (32)

The proof for the prediction error bound is straightforward; it can be shownby combining the restricted eigenvalue assumption (29) and the inequality (31)which is derived in the process of proving the consistency result.

• Assumptions for variable selection consistency result: To address variable se-lection consistency of the LASSO solution wLasso, the authors provide a distinctset of assumptions which are related to the structures of the matrix X. The mu-

tual incoherence (sometimes also referred as irrepresentability) condition states thatthere must exist some γLasso > 0 such that

maxj∈Sc0

‖ (XTS0XS0

)−1XTS0xj ‖1 ≤ 1− γLasso. (33)

The authors point out that in the most desirable case, any jth column xjwhere j belongs to the set of indices consisting zero components of w 0 wouldbe orthogonal to the columns of XS0

∈ RN×|S0 |, which is the submatrix of Xthat consists of columns corresponding to S0. As such is not attainable for high-dimensional linear regression, the assumption ensures that ‘near orthogonality’ tohold for the design matrix. In addition, they assume

max1≤j≤d

1√N‖xj ‖2 ≤ Kclm (34)


for some Kclm > 0, which can be interpreted as the matrix X has normalizedcolumns. For example, the matrix can be normalized such that ‖xj ‖2 is equal to√N for any j, resulting the value of constant Kclm to be 1. The last assumption

made on the matrix X is

λmin

(XTS0XS0

N

)≥ Cmin (35)

for some positive constant Cmin, where λmin denotes the minimum eigenvalue ofthe given matrix. The authors note that if this condition is violated then thecolumns of XS0

are linearly dependent, and it is not possible to recover w 0 evenif its supporting indices are known.

• Variable selection consistency [13, Theorem 11.3]: Suppose the matrix X

satisfies the mutual incoherence condition (33) with parameter γLasso > 0, thecolumn normalization condition (34) and the eigenvalue condition (35). For a noisevector ε ∈ RN with i.i.d N (0, σ2) entries, consider the LASSO problem (28) witha regularization parameter

λN ≥8Kclm σ

γLasso

√log d

N.

Then with a probability greater than 1 − c1e−c2Nλ2N , the Lasso has the following

properties:

1. Uniqueness: the optimal solution wLasso is unique;2. No false inclusion: The unique optimal solution has its support contained

within S0, i.e., support(wLasso) ⊆ support(w 0);3. `∞- bound: the error wLasso − w 0 satisfies the `∞ bound

‖ wLassoS0

− w 0S0‖∞ ≤ λN

[4σ√Cmin

+ ‖ (XTS0XS0

/N)−1 ‖∞]

︸︷︷︸B(λN , σ;X)

where ‖A‖∞ for a matrix A is defined as max‖u‖∞=1

‖Au‖∞;

4. No false exclusion: the nonzero components of the LASSO solution wLasso

include all indices j ∈ S0 such that |w 0j | > B(λN , σ; X), and hence is variable

selection consistent as long as minj∈S0

|w 0j | > B(λN , σ; X).

Showing the uniqueness involves solving a hypothetical problem; they set wLassoSc0

=0 and solve a reduced-size problem where the objective function of LASSO is min-imized with respect to wS0

∈ R|S0 |. By properties of convexity and the first orderoptimality condition (referred as zero-subgradient condition in the reference), theauthors show that all optimal solutions of the original LASSO are supported onlyon S0 thus the solutions can be obtained by solving the reduced problem. Thelower eigenvalue condition (35) then is used to show the uniqueness.

By the first order optimality condition for the convex and non-differentiableproblems, there exists a subgradient of ‖ • ‖1, denoted by z, such that 1

NXT (Y −

26 Miju Ahn

X wLasso ) + λN z = 0. This equation can be rewritten in a block-matrix form bysubstituting the definition of Y :

1

N

[XTS0XS0

XTS0XSc0

XTSc0XS0

XTSc0XSc0

] [wLassoS0

− w 0S0

0

]+

1

N

[XTS0ε

XTSc0ε

]+ λN

[zS0

zSc0

]=

[0

0

].

This is the key equation which is used to show the remaining parts of the the-orem. By applying the assumptions, the authors investigate the quantity wLasso

S0−

w 0S0

by examining the above equation. Due to the existence of the error vectorε, probability is introduced in the statement; the error is a zero-mean Gaussianrandom noise hence the authors apply related probabilistic bounds to achieve thethird part of the theorem.

consistency bounds and support recovery of d-stationary ... · learning, the authors of [17,18]...

Documents