robust estimation via (non)convex m-estimation · 2018. 8. 25. · outline 1 regularized...

Post on 24-Feb-2021

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Robust estimation via (non)convex M-estimation

Po-Ling Loh

University of Wisconsin - MadisonDepartments of ECE & Statistics

Workshop on computational efficiency andhigh-dimensional robust statistics

TTI Chicago

August 15, 2018

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 1 / 39

Outline

1 Regularized M-estimatorsStatistical M-estimationNonconvexityConsistency of local optima

2 High-dimensional robust regressionStatistical consistencyAsymptotic normalityTwo-step M-estimators

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 2 / 39

Outline

1 Regularized M-estimatorsStatistical M-estimationNonconvexityConsistency of local optima

2 High-dimensional robust regressionStatistical consistencyAsymptotic normalityTwo-step M-estimators

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 3 / 39

Regularized M-estimators

Prediction/regression problem: Observe {(xi , yi )}ni=1, estimate

β∗ = arg minβ∈Rp

E[`(β; xi , yi )], xi ∈ Rp, yi ∈ R

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 4 / 39

Regularized M-estimators

Prediction/regression problem: Observe {(xi , yi )}ni=1, estimate

β∗ = arg minβ∈Rp

E[`(β; xi , yi )], xi ∈ Rp, yi ∈ R

Statistical M-estimator:

β̂ ∈ arg minβ

{1

n

n∑

i=1

`(β; xi , yi )

}

in high dimensions, may be ill-conditioned, large solution space

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 4 / 39

Regularized M-estimators

Prediction/regression problem: Observe {(xi , yi )}ni=1, estimate

β∗ = arg minβ∈Rp

E[`(β; xi , yi )], xi ∈ Rp, yi ∈ R

Regularized M-estimator:

β̂ ∈ arg minβ

{1

n

n∑

i=1

`(β; xi , yi )

︸ ︷︷ ︸Ln(β)

+ρλ(β)

}

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 4 / 39

Example: `1-regularized OLS regression

Linear model: yi = xTi β∗ + εi , ‖β∗‖0 ≤ k

Low-dimensional M-estimator:

β̂OLS ∈ arg minβ

{1

n

n∑

i=1

(yi − xTi β)2

}

=

(1

n

n∑

i=1

xixTi

)−1(1

n

n∑

i=1

yixi

)

High-dimensional regularized M-estimator:

β̂Lasso ∈ arg minβ

{1

n

n∑

i=1

(yi − xTi β)2 + λ‖β‖1}

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 5 / 39

Example: `1-regularized OLS regression

Linear model: yi = xTi β∗ + εi , ‖β∗‖0 ≤ k

Low-dimensional M-estimator:

β̂OLS ∈ arg minβ

{1

n

n∑

i=1

(yi − xTi β)2

}

=

(1

n

n∑

i=1

xixTi

)−1(1

n

n∑

i=1

yixi

)

High-dimensional regularized M-estimator:

β̂Lasso ∈ arg minβ

{1

n

n∑

i=1

(yi − xTi β)2 + λ‖β‖1}

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 5 / 39

Example: `1-regularized OLS regression

Linear model: yi = xTi β∗ + εi , ‖β∗‖0 ≤ k

Low-dimensional M-estimator:

β̂OLS ∈ arg minβ

{1

n

n∑

i=1

(yi − xTi β)2

}=

(1

n

n∑

i=1

xixTi

)−1(1

n

n∑

i=1

yixi

)

High-dimensional regularized M-estimator:

β̂Lasso ∈ arg minβ

{1

n

n∑

i=1

(yi − xTi β)2 + λ‖β‖1}

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 5 / 39

Example: `1-regularized OLS regression

Linear model: yi = xTi β∗ + εi , ‖β∗‖0 ≤ k

Low-dimensional M-estimator:

β̂OLS ∈ arg minβ

{1

n

n∑

i=1

(yi − xTi β)2

}=

(1

n

n∑

i=1

xixTi

)−1(1

n

n∑

i=1

yixi

)

High-dimensional regularized M-estimator:

β̂Lasso ∈ arg minβ

{1

n

n∑

i=1

(yi − xTi β)2 + λ‖β‖1}

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 5 / 39

Sources of nonconvexity

May arise in loss or regularizer

Nonconvex loss used to correct bias, increase efficiency

Nonconvex regularizer used to reduce bias, achieve oracle result

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 6 / 39

Sources of nonconvexity

May arise in loss or regularizer

Nonconvex loss used to correct bias, increase efficiency

Nonconvex regularizer used to reduce bias, achieve oracle result

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 6 / 39

Sources of nonconvexity

May arise in loss or regularizer

Nonconvex loss used to correct bias, increase efficiency

Nonconvex regularizer used to reduce bias, achieve oracle result

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 6 / 39

Example: Errors-in-variables regression

Model:yi = xTi β

∗ + εi

observe {(zi , yi )}ni=1, infer β∗

OLS estimator

β̂ ∈ arg minβ

{1

n

n∑

i=1

(zTi β − yi )2 + λ‖β‖1

}

statistically inconsistent

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 7 / 39

Example: Errors-in-variables regression

Model:yi = xTi β

∗ + εi

observe {(zi , yi )}ni=1, infer β∗

OLS estimator

β̂ ∈ arg minβ

{1

n

n∑

i=1

(zTi β − yi )2 + λ‖β‖1

}

statistically inconsistent

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 7 / 39

Corrected losses

L. & Wainwright ’12 propose natural method for correcting loss forlinear regression:

β̂OLS ∈ arg minβ

{1

2βT

XTX

nβ − yXT

nβ + ρλ(β)

}

β̂corr ∈ arg minβ

{1

2βT Γ̂β − γ̂Tβ + ρλ(β)

}

(Γ̂, γ̂) estimators for (Cov(xi ),Cov(xi , yi )) based on {(zi , yi )}ni=1

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 8 / 39

Example: Additive noise

Additive noise: Z = X + W , use

Γ̂ =ZTZ

n− Σw , γ̂ =

ZT y

n

However, corrected objective nonconvex:

β̂corr ∈ arg minβ

{1

2βT(ZTZ

n− Σw

)β − yTZ

nβ + ρλ(β)

}

Fortunately, local optima have good properties

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 9 / 39

Example: Additive noise

Additive noise: Z = X + W , use

Γ̂ =ZTZ

n− Σw , γ̂ =

ZT y

n

However, corrected objective nonconvex:

β̂corr ∈ arg minβ

{1

2βT(ZTZ

n− Σw

)β − yTZ

nβ + ρλ(β)

}

Fortunately, local optima have good properties

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 9 / 39

Nonconvex regularizers

`1 is “convexified” version of `0

1

0

`0

0

`1

But `1 penalizes larger coefficients more, causes solution bias

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 10 / 39

Nonconvex regularizers

`1 is “convexified” version of `0

1

0

`0

0

`1

But `1 penalizes larger coefficients more, causes solution bias

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 10 / 39

Alternative regularizers

Various nonconvex regularizers in literature (Fan & Li ’01, Zhang ’10, etc.)

−4 −2 0 2 40

0.5

1

1.5

2

2.5

3

3.5

4

SCADMCPlqcapped l1l1

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 11 / 39

Empirical benefits

Nonconvex regularizers show significant improvement (Breheny &

Huang ’11)

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 12 / 39

Local vs. global optima

Optimization algorithms only guaranteed to find local optima(stationary points)

Statistical theory only guarantees consistency of global optima

�⇤b� e�

O

rk log p

n

!

L. & Wainwright ’13: All stationary points of Ln(β) + ρλ(β) closewhen nonconvexity smaller than curvature

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 13 / 39

Local vs. global optima

Optimization algorithms only guaranteed to find local optima(stationary points)

Statistical theory only guarantees consistency of global optima

�⇤b� e�

O

rk log p

n

!

L. & Wainwright ’13: All stationary points of Ln(β) + ρλ(β) closewhen nonconvexity smaller than curvature

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 13 / 39

Measures of closeness

Various measures of statistical consistency

β̂ ∈ arg minβ

{1

n

n∑

i=1

`(β; xi , yi ) + ρλ(β)

}

Estimation: ‖β̂ − β∗‖ → 0

Prediction: 1n

∑ni=1 `(β̂; xi , yi )→ 0

Variable selection: supp(β̂)→ supp(β∗)

Interested in cases where ` and ρλ possibly nonconvex

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 14 / 39

Measures of closeness

Various measures of statistical consistency

β̂ ∈ arg minβ

{1

n

n∑

i=1

`(β; xi , yi ) + ρλ(β)

}

Estimation: ‖β̂ − β∗‖ → 0

Prediction: 1n

∑ni=1 `(β̂; xi , yi )→ 0

Variable selection: supp(β̂)→ supp(β∗)

Interested in cases where ` and ρλ possibly nonconvex

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 14 / 39

Measures of closeness

Various measures of statistical consistency

β̂ ∈ arg minβ

{1

n

n∑

i=1

`(β; xi , yi ) + ρλ(β)

}

Estimation: ‖β̂ − β∗‖ → 0

Prediction: 1n

∑ni=1 `(β̂; xi , yi )→ 0

Variable selection: supp(β̂)→ supp(β∗)

Interested in cases where ` and ρλ possibly nonconvex

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 14 / 39

Estimation/prediction consistency

Composite objective function

β̂ ∈ arg min‖β‖1≤R

{Ln(β) +

p∑

j=1

ρλ(βj)

}

Ln satisfies restricted strong convexity with curvature α

ρλ has bounded subgradient at 0, and ρλ(t) + µt2 convex

L. & Wainwright ’13: All stationary points of Ln(β) + ρλ(β) closewhen α > µ

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 15 / 39

Estimation/prediction consistency

Composite objective function

β̂ ∈ arg min‖β‖1≤R

{Ln(β) +

p∑

j=1

ρλ(βj)

}

Ln satisfies restricted strong convexity with curvature α

ρλ has bounded subgradient at 0, and ρλ(t) + µt2 convex

L. & Wainwright ’13: All stationary points of Ln(β) + ρλ(β) closewhen α > µ

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 15 / 39

Estimation/prediction consistency

Composite objective function

β̂ ∈ arg min‖β‖1≤R

{Ln(β) +

p∑

j=1

ρλ(βj)

}

Ln satisfies restricted strong convexity with curvature α

ρλ has bounded subgradient at 0, and ρλ(t) + µt2 convex

L. & Wainwright ’13: All stationary points of Ln(β) + ρλ(β) closewhen α > µ

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 15 / 39

Geometric intuition

Population-level convexity, finite-sample nonconvexity

function view gradient view

�⇤

Ln

�⇤

rLn

Population-level objective L strongly convex, α > µ

RSC quantifies convergence rate of ∇Ln −→ ∇L

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 16 / 39

Geometric intuition

Population-level convexity, finite-sample nonconvexity

function view gradient view

�⇤

Ln

�⇤

rLn

Population-level objective L strongly convex, α > µ

RSC quantifies convergence rate of ∇Ln −→ ∇L

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 16 / 39

More formally

�⇤b� e�

O

rk log p

n

!

Stationary points statistically indistinguishable from global optima

〈∇Ln(β̃) +∇ρλ(β̃), β − β̃〉 ≥ 0, ∀β feasible

Nonasymptotic rates: For λ �√

log pn and R � 1

λ ,

‖β̃ − β∗‖2 ≤ c

√k log p

n≈ statistical error

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 17 / 39

More formally

�⇤b� e�

O

rk log p

n

!

Stationary points statistically indistinguishable from global optima

〈∇Ln(β̃) +∇ρλ(β̃), β − β̃〉 ≥ 0, ∀β feasible

Nonasymptotic rates: For λ �√

log pn and R � 1

λ ,

‖β̃ − β∗‖2 ≤ c

√k log p

n≈ statistical error

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 17 / 39

Technical conditions

Requirements on loss and regularizer to ensure consistency ofstationary points

Restricted strong convexity of Ln

Bound on nonconvexity of ρλ

“Oracle” result under additional condition on ρλ

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 18 / 39

Technical conditions

Requirements on loss and regularizer to ensure consistency ofstationary points

Restricted strong convexity of Ln

Bound on nonconvexity of ρλ

“Oracle” result under additional condition on ρλ

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 18 / 39

Technical conditions

Requirements on loss and regularizer to ensure consistency ofstationary points

Restricted strong convexity of Ln

Bound on nonconvexity of ρλ

“Oracle” result under additional condition on ρλ

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 18 / 39

Conditions on Ln

Restricted strong convexity (Negahban et al. ’12):

〈∇Ln(β∗+∆)−∇Ln(β∗), ∆〉 ≥{α‖∆‖22 − τ log p

n ‖∆‖21, ∀‖∆‖2 ≤ r

α‖∆‖2 − τ√

log pn ‖∆‖1, o.w.

How is such a result possible?

−1−0.5

00.5

1 −1−0.5

00.5

1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Loss function has directions of both positive and negative curvature.

Negative directions are forbidden by regularizer.

Holds for various convex/nonconvex losses:OLS & corrected OLS for linear regression, log likelihood for GLMsHuber loss for robust regression

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 19 / 39

Conditions on Ln

Restricted strong convexity (Negahban et al. ’12):

〈∇Ln(β∗+∆)−∇Ln(β∗), ∆〉 ≥{α‖∆‖22 − τ log p

n ‖∆‖21, ∀‖∆‖2 ≤ r

α‖∆‖2 − τ√

log pn ‖∆‖1, o.w.

How is such a result possible?

−1−0.5

00.5

1 −1−0.5

00.5

1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Loss function has directions of both positive and negative curvature.

Negative directions are forbidden by regularizer.Holds for various convex/nonconvex losses:OLS & corrected OLS for linear regression, log likelihood for GLMsHuber loss for robust regression

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 19 / 39

Conditions on ρλ

Focus on amenable regularizers ρλ(β) =∑p

j=1 ρλ(βj) satisfying:

ρλ(0) = 0, symmetric around 0Nondecreasing on R+

t 7→ ρλ(t)t nonincreasing on R+

qλ(t) := λ|t| − ρλ(t) differentiable everywhereρλ(t) + µt2 convex for some µ > 0ρ′λ(t) = 0 for t ≥ γλ, for some γ > 0

Examples:

µ-amenable: `1, SCAD, MCP, LSP(µ, γ)-amenable: SCAD, MCPNeither: capped-`1, bridge penalty (`q, 0 < q < 1)

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 20 / 39

Conditions on ρλ

Focus on amenable regularizers ρλ(β) =∑p

j=1 ρλ(βj) satisfying:

ρλ(0) = 0, symmetric around 0Nondecreasing on R+

t 7→ ρλ(t)t nonincreasing on R+

qλ(t) := λ|t| − ρλ(t) differentiable everywhereρλ(t) + µt2 convex for some µ > 0

ρ′λ(t) = 0 for t ≥ γλ, for some γ > 0

Examples:

µ-amenable: `1, SCAD, MCP, LSP(µ, γ)-amenable: SCAD, MCPNeither: capped-`1, bridge penalty (`q, 0 < q < 1)

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 20 / 39

Conditions on ρλ

Focus on amenable regularizers ρλ(β) =∑p

j=1 ρλ(βj) satisfying:

ρλ(0) = 0, symmetric around 0Nondecreasing on R+

t 7→ ρλ(t)t nonincreasing on R+

qλ(t) := λ|t| − ρλ(t) differentiable everywhereρλ(t) + µt2 convex for some µ > 0ρ′λ(t) = 0 for t ≥ γλ, for some γ > 0

Examples:

µ-amenable: `1, SCAD, MCP, LSP(µ, γ)-amenable: SCAD, MCPNeither: capped-`1, bridge penalty (`q, 0 < q < 1)

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 20 / 39

Conditions on ρλ

Focus on amenable regularizers ρλ(β) =∑p

j=1 ρλ(βj) satisfying:

ρλ(0) = 0, symmetric around 0Nondecreasing on R+

t 7→ ρλ(t)t nonincreasing on R+

qλ(t) := λ|t| − ρλ(t) differentiable everywhereρλ(t) + µt2 convex for some µ > 0ρ′λ(t) = 0 for t ≥ γλ, for some γ > 0

Examples:

µ-amenable: `1, SCAD, MCP, LSP

(µ, γ)-amenable: SCAD, MCPNeither: capped-`1, bridge penalty (`q, 0 < q < 1)

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 20 / 39

Conditions on ρλ

Focus on amenable regularizers ρλ(β) =∑p

j=1 ρλ(βj) satisfying:

ρλ(0) = 0, symmetric around 0Nondecreasing on R+

t 7→ ρλ(t)t nonincreasing on R+

qλ(t) := λ|t| − ρλ(t) differentiable everywhereρλ(t) + µt2 convex for some µ > 0ρ′λ(t) = 0 for t ≥ γλ, for some γ > 0

Examples:

µ-amenable: `1, SCAD, MCP, LSP(µ, γ)-amenable: SCAD, MCP

Neither: capped-`1, bridge penalty (`q, 0 < q < 1)

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 20 / 39

Conditions on ρλ

Focus on amenable regularizers ρλ(β) =∑p

j=1 ρλ(βj) satisfying:

ρλ(0) = 0, symmetric around 0Nondecreasing on R+

t 7→ ρλ(t)t nonincreasing on R+

qλ(t) := λ|t| − ρλ(t) differentiable everywhereρλ(t) + µt2 convex for some µ > 0ρ′λ(t) = 0 for t ≥ γλ, for some γ > 0

Examples:

µ-amenable: `1, SCAD, MCP, LSP(µ, γ)-amenable: SCAD, MCPNeither: capped-`1, bridge penalty (`q, 0 < q < 1)

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 20 / 39

Alternative regularizers

Various nonconvex regularizers in literature (Fan & Li ’01, Zhang ’10, etc.)

−4 −2 0 2 40

0.5

1

1.5

2

2.5

3

3.5

4

SCADMCPlqcapped l1l1

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 21 / 39

Statistical consistency

Regularized M-estimator

β̂ ∈ arg min‖β‖1≤R

{Ln(β) + ρλ(β)} ,

loss function satisfies (α, τ)-RSC and regularizer is µ-amenable

Theorem (L. & Wainwright ’13)

Suppose R is chosen s.t. β∗ is feasible, and λ satisfies

max

{‖∇Ln(β∗)‖∞, α

√log p

n

}- λ -

α

R.

For n ≥ Cτ2

α2 R2 log p, any stationary point β̃ satisfies

‖β̃ − β∗‖2 -λ√k

α− µ, where k = ‖β∗‖0.

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 22 / 39

Statistical consistency

Regularized M-estimator

β̂ ∈ arg min‖β‖1≤R

{Ln(β) + ρλ(β)} ,

loss function satisfies (α, τ)-RSC and regularizer is µ-amenable

Theorem (L. & Wainwright ’13)

Suppose R is chosen s.t. β∗ is feasible, and λ satisfies

max

{‖∇Ln(β∗)‖∞, α

√log p

n

}- λ -

α

R.

For n ≥ Cτ2

α2 R2 log p, any stationary point β̃ satisfies

‖β̃ − β∗‖2 -λ√k

α− µ, where k = ‖β∗‖0.

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 22 / 39

Statistical consistency

Regularized M-estimator

β̂ ∈ arg min‖β‖1≤R

{Ln(β) + ρλ(β)} ,

loss function satisfies (α, τ)-RSC and regularizer is µ-amenable

Theorem (L. & Wainwright ’13)

Suppose R is chosen s.t. β∗ is feasible, and λ satisfies

max

{‖∇Ln(β∗)‖∞, α

√log p

n

}- λ -

α

R.

For n ≥ Cτ2

α2 R2 log p, any stationary point β̃ satisfies

‖β̃ − β∗‖2 -λ√k

α− µ, where k = ‖β∗‖0.

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 22 / 39

More formally

�⇤b� e�

O

rk log p

n

!

Stationary points statistically indistinguishable from global optima

〈∇Ln(β̃) +∇ρλ(β̃), β − β̃〉 ≥ 0, ∀β feasible

Nonasymptotic rates: For λ �√

log pn and R � 1

λ ,

‖β̃ − β∗‖2 ≤ c

√k log p

n≈ statistical error

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 23 / 39

Outline

1 Regularized M-estimatorsStatistical M-estimationNonconvexityConsistency of local optima

2 High-dimensional robust regressionStatistical consistencyAsymptotic normalityTwo-step M-estimators

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 24 / 39

Robust regression functions

Linear model:yi = xTi β

∗ + εi

Heavy-tailed distribution on εi and/or outlier contamination

Use M-estimator

β̂ ∈ arg minβ

{1

n

n∑

i=1

`(xTi β − yi )

}

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 25 / 39

Robust regression functions

Linear model:yi = xTi β

∗ + εi

Heavy-tailed distribution on εi and/or outlier contamination

Use M-estimator

β̂ ∈ arg minβ

{1

n

n∑

i=1

`(xTi β − yi )

}

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 25 / 39

Classes of loss functions

Bounded `′ limits influence of outliers:

IF ((x , y);T ,F ) = limt→0+

T ((1− t)F + tδ(x ,y))− T (F )

t

∝ `′(xTβ − y)x

where F ∼ Fβ and T minimizes M-estimator

Redescending M-estimators have finite rejection point:

`′(u) = 0, for |u| ≥ c

IntroductionRobust regression

ImplementationScottish hill races

Loss functions

−6 −4 −2 0 2 4 6

01

23

45

6

Residual

Loss

Least squaresAbsolute valueHuberTukey

Patrick Breheny BST 764: Applied Statistical Modeling 6/17But bad for optimization!!

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 26 / 39

Classes of loss functions

Bounded `′ limits influence of outliers:

IF ((x , y);T ,F ) = limt→0+

T ((1− t)F + tδ(x ,y))− T (F )

t

∝ `′(xTβ − y)x

where F ∼ Fβ and T minimizes M-estimatorRedescending M-estimators have finite rejection point:

`′(u) = 0, for |u| ≥ c

IntroductionRobust regression

ImplementationScottish hill races

Loss functions

−6 −4 −2 0 2 4 6

01

23

45

6

Residual

Loss

Least squaresAbsolute valueHuberTukey

Patrick Breheny BST 764: Applied Statistical Modeling 6/17

But bad for optimization!!

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 26 / 39

Classes of loss functions

Bounded `′ limits influence of outliers:

IF ((x , y);T ,F ) = limt→0+

T ((1− t)F + tδ(x ,y))− T (F )

t

∝ `′(xTβ − y)x

where F ∼ Fβ and T minimizes M-estimatorRedescending M-estimators have finite rejection point:

`′(u) = 0, for |u| ≥ c

IntroductionRobust regression

ImplementationScottish hill races

Loss functions

−6 −4 −2 0 2 4 6

01

23

45

6

Residual

Loss

Least squaresAbsolute valueHuberTukey

Patrick Breheny BST 764: Applied Statistical Modeling 6/17But bad for optimization!!Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 26 / 39

High-dimensional M-estimators

Natural idea: For p > n, use regularized version:

β̂ ∈ arg minβ

{1

n

n∑

i=1

`(xTi β − yi ) + λ‖β‖1}

Complications:

Optimization for nonconvex `?

Statistical theory? Are certain losses provably better than others?

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 27 / 39

High-dimensional M-estimators

Natural idea: For p > n, use regularized version:

β̂ ∈ arg minβ

{1

n

n∑

i=1

`(xTi β − yi ) + λ‖β‖1}

Complications:

Optimization for nonconvex `?

Statistical theory? Are certain losses provably better than others?

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 27 / 39

Overview of results

When ‖`′‖∞ < C , global optima of high-dimensional M-estimatorsatisfy

‖β̂ − β∗‖2 ≤ C

√k log p

n,

regardless of distribution of εi

Compare to Lasso theory: Requires sub-Gaussian εi ’s

If `(u) is locally convex/smooth for |u| ≤ r , any local optima withinradius cr of β∗ satisfy

‖β̃ − β∗‖2 ≤ C ′√

k log p

n

Local optima may be obtained via two-step algorithm

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 28 / 39

Overview of results

When ‖`′‖∞ < C , global optima of high-dimensional M-estimatorsatisfy

‖β̂ − β∗‖2 ≤ C

√k log p

n,

regardless of distribution of εi

Compare to Lasso theory: Requires sub-Gaussian εi ’s

If `(u) is locally convex/smooth for |u| ≤ r , any local optima withinradius cr of β∗ satisfy

‖β̃ − β∗‖2 ≤ C ′√

k log p

n

Local optima may be obtained via two-step algorithm

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 28 / 39

Overview of results

When ‖`′‖∞ < C , global optima of high-dimensional M-estimatorsatisfy

‖β̂ − β∗‖2 ≤ C

√k log p

n,

regardless of distribution of εi

Compare to Lasso theory: Requires sub-Gaussian εi ’s

If `(u) is locally convex/smooth for |u| ≤ r , any local optima withinradius cr of β∗ satisfy

‖β̃ − β∗‖2 ≤ C ′√

k log p

n

Local optima may be obtained via two-step algorithm

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 28 / 39

Overview of results

When ‖`′‖∞ < C , global optima of high-dimensional M-estimatorsatisfy

‖β̂ − β∗‖2 ≤ C

√k log p

n,

regardless of distribution of εi

Compare to Lasso theory: Requires sub-Gaussian εi ’s

If `(u) is locally convex/smooth for |u| ≤ r , any local optima withinradius cr of β∗ satisfy

‖β̃ − β∗‖2 ≤ C ′√

k log p

n

Local optima may be obtained via two-step algorithm

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 28 / 39

Theoretical insight

Lasso analysis (e.g., van de Geer ’07, Bickel et al. ’08):

β̂ ∈ arg minβ

{1

n‖y − Xβ‖22 + λ‖β‖1︸ ︷︷ ︸

Ln(β)

}

Rearranging basic inequality Ln(β̂) ≤ Ln(β∗) and assuming

λ ≥ 2∥∥∥XT ε

n

∥∥∥∞

, obtain

‖β̂ − β∗‖2 ≤ cλ√k

Sub-Gaussian assumptions on xi ’s and εi ’s provide O(√

k log pn

)

bounds, minimax optimal

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 29 / 39

Theoretical insight

Lasso analysis (e.g., van de Geer ’07, Bickel et al. ’08):

β̂ ∈ arg minβ

{1

n‖y − Xβ‖22 + λ‖β‖1︸ ︷︷ ︸

Ln(β)

}

Rearranging basic inequality Ln(β̂) ≤ Ln(β∗) and assuming

λ ≥ 2∥∥∥XT ε

n

∥∥∥∞

, obtain

‖β̂ − β∗‖2 ≤ cλ√k

Sub-Gaussian assumptions on xi ’s and εi ’s provide O(√

k log pn

)

bounds, minimax optimal

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 29 / 39

Theoretical insight

Lasso analysis (e.g., van de Geer ’07, Bickel et al. ’08):

β̂ ∈ arg minβ

{1

n‖y − Xβ‖22 + λ‖β‖1︸ ︷︷ ︸

Ln(β)

}

Rearranging basic inequality Ln(β̂) ≤ Ln(β∗) and assuming

λ ≥ 2∥∥∥XT ε

n

∥∥∥∞

, obtain

‖β̂ − β∗‖2 ≤ cλ√k

Sub-Gaussian assumptions on xi ’s and εi ’s provide O(√

k log pn

)

bounds, minimax optimal

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 29 / 39

Theoretical insight

Key observation: For general loss function, if λ ≥ 2∥∥∥XT `′(ε)

n

∥∥∥∞

,

obtain‖β̂ − β∗‖2 ≤ cλ

√k

`′(ε) sub-Gaussian whenever `′ bounded=⇒ can achieve estimation error

‖β̂ − β∗‖2 ≤ c

√k log p

n,

without assuming εi is sub-Gaussian

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 30 / 39

Theoretical insight

Key observation: For general loss function, if λ ≥ 2∥∥∥XT `′(ε)

n

∥∥∥∞

,

obtain‖β̂ − β∗‖2 ≤ cλ

√k

`′(ε) sub-Gaussian whenever `′ bounded

=⇒ can achieve estimation error

‖β̂ − β∗‖2 ≤ c

√k log p

n,

without assuming εi is sub-Gaussian

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 30 / 39

Theoretical insight

Key observation: For general loss function, if λ ≥ 2∥∥∥XT `′(ε)

n

∥∥∥∞

,

obtain‖β̂ − β∗‖2 ≤ cλ

√k

`′(ε) sub-Gaussian whenever `′ bounded=⇒ can achieve estimation error

‖β̂ − β∗‖2 ≤ c

√k log p

n,

without assuming εi is sub-Gaussian

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 30 / 39

Local statistical consistency

Local RSC condition: For ∆ := β1 − β2,

〈∇Ln(β1)−∇Ln(β2), ∆〉 ≥ α‖∆‖22−τlog p

n‖∆‖21, ∀‖βj −β∗‖2 ≤ r

How is such a result possible?

−1−0.5

00.5

1 −1−0.5

00.5

1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Loss function has directions of both positive and negative curvature.

Negative directions are forbidden by regularizer.

Only requires restricted curvature within constant-radius regionaround β∗

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 31 / 39

Local statistical consistency

Local RSC condition: For ∆ := β1 − β2,

〈∇Ln(β1)−∇Ln(β2), ∆〉 ≥ α‖∆‖22−τlog p

n‖∆‖21, ∀‖βj −β∗‖2 ≤ r

How is such a result possible?

−1−0.5

00.5

1 −1−0.5

00.5

1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Loss function has directions of both positive and negative curvature.

Negative directions are forbidden by regularizer.Only requires restricted curvature within constant-radius regionaround β∗

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 31 / 39

Consistency of local stationary points

�⇤b� e�

O

rk log p

n

!

r

Theorem (L. ’15)

Suppose Ln satisfies α-local RSC and ρλ is µ-amenable, with α > µ.Suppose (λ,R) are chosen appropriately. For n % τ

α−µk log p, any

stationary point β̃ s.t. ‖β̃ − β∗‖2 ≤ r satisfies

‖β̃ − β∗‖2 -λ√k

α− µ.

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 32 / 39

Consistency of local stationary points

�⇤b� e�

O

rk log p

n

!

r

Theorem (L. ’15)

Suppose Ln satisfies α-local RSC and ρλ is µ-amenable, with α > µ.Suppose (λ,R) are chosen appropriately. For n % τ

α−µk log p, any

stationary point β̃ s.t. ‖β̃ − β∗‖2 ≤ r satisfies

‖β̃ − β∗‖2 -λ√k

α− µ.

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 32 / 39

Second-order considerations

Question: If any bounded-derivative ` works for heavy-taileddistributions, why not always use Huber? LAD?

Answer: Has to do with (asymptotic) efficiency

In low-dimensional settings, MLE maximally efficient with respect tovariance

Although MLE may behave erratically when pn → (0, 1], can achieve

simple asymptotic normality results under sparsity assumption, viaoracle property

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 33 / 39

Second-order considerations

Question: If any bounded-derivative ` works for heavy-taileddistributions, why not always use Huber? LAD?

Answer: Has to do with (asymptotic) efficiency

In low-dimensional settings, MLE maximally efficient with respect tovariance

Although MLE may behave erratically when pn → (0, 1], can achieve

simple asymptotic normality results under sparsity assumption, viaoracle property

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 33 / 39

Second-order considerations

Question: If any bounded-derivative ` works for heavy-taileddistributions, why not always use Huber? LAD?

Answer: Has to do with (asymptotic) efficiency

In low-dimensional settings, MLE maximally efficient with respect tovariance

Although MLE may behave erratically when pn → (0, 1], can achieve

simple asymptotic normality results under sparsity assumption, viaoracle property

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 33 / 39

Second-order considerations

Question: If any bounded-derivative ` works for heavy-taileddistributions, why not always use Huber? LAD?

Answer: Has to do with (asymptotic) efficiency

In low-dimensional settings, MLE maximally efficient with respect tovariance

Although MLE may behave erratically when pn → (0, 1], can achieve

simple asymptotic normality results under sparsity assumption, viaoracle property

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 33 / 39

Asymptotic efficiency

n/(k log p)0 5 10 15

∥β̂−β∗∥ 2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1l2-error for robust regression losses

p=128p=256p=512HuberCauchy

n/(k log p)10 11 12 13 14 15 16 17 18 19 20

empi

rical

var

ianc

e of

firs

t com

pone

nt

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35variance for robust regression losses

p=128p=256p=512HuberCauchy

`2-error and empirical variance of M-estimators when errors followCauchy distribution (SCAD regularizer)

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 34 / 39

How to obtain local efficient solutions?

Goal: Nonconvex regularized M-estimator such that ` satisfiesα-local RSC

Target: Locate stationary point within radius r of β∗

Descending ψ-functions are tricky, especially when thestarting values for the iterations are non-robust. . . . It istherefore preferable to start with a monotone ψ, iterate todeath, and then append a few (1 or 2) iterations with thenonmonotone ψ. — Huber 1981, pp. 191–192

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 35 / 39

How to obtain local efficient solutions?

Goal: Nonconvex regularized M-estimator such that ` satisfiesα-local RSC

Target: Locate stationary point within radius r of β∗

Descending ψ-functions are tricky, especially when thestarting values for the iterations are non-robust. . . . It istherefore preferable to start with a monotone ψ, iterate todeath, and then append a few (1 or 2) iterations with thenonmonotone ψ. — Huber 1981, pp. 191–192

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 35 / 39

How to obtain local efficient solutions?

Goal: Nonconvex regularized M-estimator such that ` satisfiesα-local RSC

Target: Locate stationary point within radius r of β∗

Descending ψ-functions are tricky, especially when thestarting values for the iterations are non-robust. . . . It istherefore preferable to start with a monotone ψ, iterate todeath, and then append a few (1 or 2) iterations with thenonmonotone ψ. — Huber 1981, pp. 191–192

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 35 / 39

Two-step algorithm (L. 15)

Use composite gradient descent starting from close initialization

Two-step M-estimator: Finds local stationary points of nonconvex,robust loss + (µ, γ)-amenable penalty

Algorithm

1 Run composite gradient descent on convex, robust loss + `1-penaltyuntil convergence, output β̂H

2 Run composite gradient descent on nonconvex, robust loss +(µ, γ)-amenable penalty, input β0 = β̂H

Theoretical guarantees on (rate of) convergence to optimal point

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 36 / 39

Two-step algorithm (L. 15)

Use composite gradient descent starting from close initialization

Two-step M-estimator: Finds local stationary points of nonconvex,robust loss + (µ, γ)-amenable penalty

Algorithm

1 Run composite gradient descent on convex, robust loss + `1-penaltyuntil convergence, output β̂H

2 Run composite gradient descent on nonconvex, robust loss +(µ, γ)-amenable penalty, input β0 = β̂H

Theoretical guarantees on (rate of) convergence to optimal point

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 36 / 39

Two-step algorithm (L. 15)

Use composite gradient descent starting from close initialization

Two-step M-estimator: Finds local stationary points of nonconvex,robust loss + (µ, γ)-amenable penalty

Algorithm

1 Run composite gradient descent on convex, robust loss + `1-penaltyuntil convergence, output β̂H

2 Run composite gradient descent on nonconvex, robust loss +(µ, γ)-amenable penalty, input β0 = β̂H

Theoretical guarantees on (rate of) convergence to optimal point

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 36 / 39

Two-step algorithm (L. 15)

Use composite gradient descent starting from close initialization

Two-step M-estimator: Finds local stationary points of nonconvex,robust loss + (µ, γ)-amenable penalty

Algorithm

1 Run composite gradient descent on convex, robust loss + `1-penaltyuntil convergence, output β̂H

2 Run composite gradient descent on nonconvex, robust loss +(µ, γ)-amenable penalty, input β0 = β̂H

Theoretical guarantees on (rate of) convergence to optimal point

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 36 / 39

Two-step algorithm (L. 15)

Use composite gradient descent starting from close initialization

Two-step M-estimator: Finds local stationary points of nonconvex,robust loss + (µ, γ)-amenable penalty

Algorithm

1 Run composite gradient descent on convex, robust loss + `1-penaltyuntil convergence, output β̂H

2 Run composite gradient descent on nonconvex, robust loss +(µ, γ)-amenable penalty, input β0 = β̂H

Theoretical guarantees on (rate of) convergence to optimal point

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 36 / 39

Summary of two-step M-estimator

Output is computationally and statistically efficient:

Computational guarantee on rate of convergence in each step ofM-estimation algorithm

Statistical guarantee on asymptotic efficiency of estimator (assumingβ-min condition and (µ, γ)-amenability)

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 37 / 39

Summary of two-step M-estimator

Output is computationally and statistically efficient:

Computational guarantee on rate of convergence in each step ofM-estimation algorithm

Statistical guarantee on asymptotic efficiency of estimator (assumingβ-min condition and (µ, γ)-amenability)

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 37 / 39

Summary

Theory for nonconvex regularized M-estimators

Global RSC condition =⇒ all stationary points within statistical errorof β∗

Local RSC condition =⇒ stationary points in local region withinstatistical error of β∗

Consequences for high-dimensional robust regression estimators

Consistency under relaxed distributional assumptions when `′ boundedOracle estimator with (µ, γ)-amenable regularizer=⇒ asymptotic efficiency

Two-step M-estimator produces local oracle solutions

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 38 / 39

Summary

Theory for nonconvex regularized M-estimators

Global RSC condition =⇒ all stationary points within statistical errorof β∗

Local RSC condition =⇒ stationary points in local region withinstatistical error of β∗

Consequences for high-dimensional robust regression estimators

Consistency under relaxed distributional assumptions when `′ bounded

Oracle estimator with (µ, γ)-amenable regularizer=⇒ asymptotic efficiency

Two-step M-estimator produces local oracle solutions

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 38 / 39

Summary

Theory for nonconvex regularized M-estimators

Global RSC condition =⇒ all stationary points within statistical errorof β∗

Local RSC condition =⇒ stationary points in local region withinstatistical error of β∗

Consequences for high-dimensional robust regression estimators

Consistency under relaxed distributional assumptions when `′ boundedOracle estimator with (µ, γ)-amenable regularizer=⇒ asymptotic efficiency

Two-step M-estimator produces local oracle solutions

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 38 / 39

References

P. Loh and M. J. Wainwright (2015). Regularized M-estimators withnonconvexity: Statistical and algorithmic theory for local optima.Journal of Machine Learning Research.

P. Loh (2017). Statistical consistency and asymptotic normality forhigh-dimensional robust M-estimators. Annals of Statistics.

Thank you!

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 39 / 39

top related