statistics and machine learning via a modern optimization...

Statistics and Machine Learning via a ModernOptimization Lens

Dimitris Bertsimas

Operations Research CenterMassachusetts Institute of Technology

June 2015

Bertsimas (MIT) Statistics via Optimization June 2015 1 / 48

Outline

1 Motivation

2 Best Subset SelectionJoint work with Angie King and Rahul Mazumder.

3 Least Median of Squares RegressionJoint work with Rahul Mazumder, Annals of Statistics, 2014

4 An Algorithmic Approach to Linear RegressionJoint work with Angie King.

5 Conclusions


Motivation

Motivation

Continuous optimization methods have historically played a significantrole in statistics

In the last two decades convex optimization methods have hadincreasing importance: Compressed Sensing, Matrix Completionamong many others

Many problems in statistics and machine learning can naturally beexpressed as Mixed integer optimizations (MIO) problems

MIO in statistics are considered impractical and the correspondingproblems intractable

Heuristics methods are used: Lasso for best subset regression orCART for optimal classification.


Progress of MIO

Progress of MIO

Speed up between CPLEX 1.2 (1991) and CPLEX 11 (2007): 29,000times

Gurobi 1.0 (2009) comparable to CPLEX 11

Speed up between Gurobi 1.0 and Gurobi 5.5 (2013): 20 times

Total speedup: 580,000 times

A MIO that would have taken 7 years to solve 20 years ago can nowbe solved on the same 20-year-old computer in less than one second.

Hardware speedup: 105.5= 320,000 times

Total Speedup: 200 Billion times!


Progress of MIO

Research Objectives

Given the dramatically increased power of MIO, is MIO able to solvekey multivariate statistics problems considered intractable a decadeago?

How do MIO solutions compete with state of the art solutions?

Building Regression models is an art. Can we algorithmize theprocess?

What are the implications on teaching statistics?


Progress of MIO

Problems

Best Subset Regression:

minβ

1

2‖y − Xβ‖22 subject to ‖β‖0 ≤ k

Least Median Regression:

minβ

mediani=1,...,n

|yi − xTi β|

Algorithmic Regression to accomodate Sparsity, Limitingmulticollinearity, Categorical variables, Group sparsity, Nonlineartransformations, Robustness, Statistical significance


Best Subset Regression


minβ

1

2‖y − Xβ‖22 subject to ‖β‖0 ≤ k

Furnival and Wilson (1974) solve it by implicit enumeration, leapsroutine in R. Cannot scale beyond p = 30.

Lasso proposed in Tibshirani (1996) [13,500 citations] and Chen,Donoho and Saunders (1998) [7100 citations] scale to very largeproblems: via convex quadratic optimization

minβ

12‖y − Xβ‖22 + λ

∑i

|βi |

Under regularity conditions on X, Lasso leads to sparse models andgood predictive performance.



Lasso is a Robustification method not sparsity inducing

Norm:

‖∆‖(q,p) = maxβ

‖∆β‖p‖β‖q

Uncertainty Set:

U(q,p) = {∆ ∈ Rm×n : ‖∆‖(q,p) ≤ λ}

Robustification - Reguralization Equivalence

minβ

max∆∈U(q,p)

‖y − (X + ∆)β‖p = minβ‖y − Xβ‖p + λ‖β‖q and

p=2, q=1: Lasso.



Our approach

Natural MIO formulation

minβ,z

12‖y − Xβ‖22

subject to |βi | ≤ M · zi , i = 1, . . . , pp∑

i=1zi ≤ k

zi ∈ {0, 1}, i = 1, . . . , p

First Order methods to find good feasible solutions as warm-starts

Enhance MIO by warm-starts and improved formulation



Parameters

µ := maxi 6=j |〈xi , xj〉|.

µ(m) := max|I|=m maxj /∈I

∑i∈I |〈xj , xi 〉| ≤ mµ

|〈x(1), y〉| ≥ |〈x(2), y〉| . . . ≥ |〈x(p), y〉|

M = min

{1ηk

√∑ki=1 |〈x(i), y〉|2,

1√ηk‖y‖2

}

ηk = (1− µ(k − 1)).



Special Ordered Sets -formulation

minβ,z

‖y − Xβ‖22

subject to (βi , 1− zi ) : SOS type-1, i = 1, . . . , pp∑

i=1zi ≤ k

zi ∈ {0, 1}, i = 1, . . . , p



Diabetes Dataset, n = 350, p = 64, k = 6

Typical behavior of Overall Algorithm

0 100 200 300 400

0.68

0.70

0.72

0.74

0.76

0.78

0.80

k=6

Time (secs)

Obj

ectiv

e

Upper BoundsLower BoundsGlobal Minimum



Diabetes Dataset, n = 350, p = 64, k = 6

Typical behavior of Overall Algorithm

0 100 200 300 400

0.00

0.05

0.10

0.15

k=6

Time (secs)

MIO

−G

ap



First Order Method

Consider

minβ

g(β) = ‖y − Xβ‖22 subject to ‖β‖0 ≤ k

g(β) convex and

‖∇g(β)−∇g(β0)‖ ≤ ` · ‖β − β0‖.

This implies that for all L ≥ `

g(β) ≤ Q(β) = g(β0) + 〈∇g(β0),β − β0〉+L

2‖β − β0‖22

For the purpose of finding feasible solutions, we propose

minβ

Q(β) subject to ‖β‖0 ≤ k



Solution

Equivalent to

minβ

L

2

∥∥∥∥β − (β0 −1

L∇g(β0)

)∥∥∥∥22

− 1

2L‖∇g(β0)‖22 s.t. ‖β‖0 ≤ k

Reducing to

minβ‖β − u‖22 subject to ‖β‖0 ≤ k

Order: |u(1)| ≥ |u(2)| ≥ . . . ≥ |u(p)|,Optimal solution is β∗ = Hk(u), where Hk(u) retains the k largestelements of u and sets the rest to zero.

(Hk(u))i =

{ui , i ∈ {(1), . . . , (k)}0, otherwise.



First Order Algorithm

1 Initialize with a solution β0; m = 0.

2 m := m + 1.

3 β̃m+1 = Hk

(βm − 1

L∇g(βm)).

4 Perform a line search: λm+1 = argminλ g(λβ̃m+1)

βm+1 = λm+1β̃m+1

5 Repeat Steps 2-4 until ‖βm+1 − βm‖ ≤ ε.



Rate of Convergence

The sequence g(βm) converges to g(β) where

β = Hk

(β − 1

L∇g(β)

).

After M iterations:

minm=0,...,M

‖βm+1 − βm‖2 ≤2 · (g(β0)− g(β))

M · (L− `)

After M = O(1ε ) iterations the Algorithm converges.



Properties

Let Sm = {i : βm,i 6= 0, βm = (βm,1, . . . , βm,p)} be the support ofthe solution βm.

The support stabilizes, i.e., there exists an m∗ such that Sm+1 = Smfor all m ≥ m∗.

Once the support stabilizes, then the algorithm converges linearly.

The practical implication is that the algorithm scales to very largeproblems and converges very fast.



Quality of Solutions for n > p

Diabetes data: n = 350, p = 64.

Relative Accuracy = (falg − f∗)/f∗

maximum time 500 secs

kFirst Order MIO Cold Start MIO Warm Start

Accuracy Time Accuracy Time Accuracy Time

9 0.1306 1 0.0036 500 0 34620 0.1541 1 0.0042 500 0 7749 0.1915 1 0.0015 500 0 8757 0.1933 1 0 500 0 1



Quality of Solutions for n < p

Synthetic data: n = 50, p = 2000.

kFirst Order MIO Cold Start MIO Warm Start

Accuracy Time Accuracy Time Accuracy Time

4 0.1091 42.9 0.2910 500 0 65.95 0.1647 37.2 1.0510 500 0 72.26 0.6152 41.1 0.2769 500 0 77.1

SN

R=

3

7 0.7843 40.7 0.8715 500 0 160.78 0.5515 38.8 2.1797 500 0 295.89 0.7131 45.0 0.4204 500 0 96.0

4 0.2708 47.8 0 31 0 107.85 0.5072 45.6 0.7737 500 0 65.66 1.3221 40.3 0.5121 500 0 82.3

SN

R=

7

7 0.9745 40.9 0.7578 500 0 210.98 0.8293 40.5 1.8972 500 0 262.59 1.1879 44.2 0.4515 500 0 254.2



Sparsity Detection for n = 500, p = 100

0

10

20

30

40

1.742 3.484 6.967Signal−to−Noise Ratio

Num

ber

of N

onze

ros

Method

MIO

Lasso

Step

Sparsenet



Prediction Error = ‖Xβalg − Xβtrue‖22/‖Xβtrue‖2

2

0.00

0.05

0.10

0.15

0.20

0.25

1.742 3.484 6.967Signal−to−Noise Ratio

Pre

dict

ion

Per

form

ance

Method

MIO

Lasso

Step

Sparsenet



Sparsity Detection for n = 50, p = 2000

0

10

20

30

40

3 7 10Signal−to−Noise Ratio

Num

ber

Non

zero

s

Method

Lasso

First Order + MIO

First Order Only

Sparsenet



Prediction Error for n = 50, p = 2000

0.0

0.1

0.2

3 7 10Signal−to−Noise Ratio

Pre

dict

ion

Per

form

ance

Method

Lasso

First Order + MIO

First Order Only

Sparsenet



What did we learn?

For the case n > p, MIO+warm-starts finds provably optimalsolutions for n = 1000s, p = 100s in minutes.

For the case n < p, MIO+warm-starts finds solutions with betterprediction accuracy than Lasso or n = 100s, p = 1000s in minutesand proving optimality in hours.

MIO solutions have a significant edge in detecting sparsity, andoutperform Lasso in prediction accuracy.

Modern optimization (MIO+warm-starts) is capable of solving largescale instances.


Regression: Art vs Science

The Art of Building Regression Models: Current Practice

Transform variables.

Pairwise scatterplots, correlation matrix.

Delete redundant variables.

Fit full model, delete variables with insignificant t-tests. Examineresiduals.

See if additional variables can be dropped/new variables brought in.

Validate the final model.



Aspirations: From Art to Science

Propose an algorithm (automated process) to build regression models.

Approach: Express all desirable properties as MIO constraints.



The initial MIO model

minβ,z

12‖y − Xβ‖22 + Γ‖β‖1 Robustness

s.t. zl ∈ {0, 1}, l = 1, . . . , p

−Mzl ≤ βl ≤Mzl , l = 1, . . . , pp∑

l=1

zl ≤ k Sparsity

z1 = . . . = zl ∀i = 1, . . . , l ∈ GSm, ∀m Group Sparsity

zi + zj ≤ 1 ∀(i , j) ∈ HC Pairwise Collinearity∑i∈Tm zi ≤ 1 ∀m Nonlinear Transform



Controlling Multicollinearity and Statistical Significance

Given a set S , we compute via bootstrap confidence levels of thevariables in S as well as the condition number of the model.

If the condition number is higher than desired or there are variablesthat are not statistically significant, we add the constraint:∑

j∈Szj ≤ |S| − 1

Iterate.



Contrast with existing practice

All desired properties are simultaneously enforced.

MIO does not have to choose which model properties to favor byperforming the steps in a certain order.

MIO is capable of handling datasets with more variables than amodeler can address manually.



Examples

MIO Max Lasso MaxDataset n p K* R2 Cor K* R2 Cor

Compact 4096 21 15 0.717 0.733 21 0.725 0.942

Elevator 8280 18 11 0.808 0.678 15 0.809 0.999

LPGA 2009 73 11 7 0.814 0.784 10 0.807 0.943

Airline Costs 15 9 2 0.672 0.501 9 0.390 0.973

Diabetes 221 64 4 0.334 0.423 14 0.381 0.672

HIV 2 528 98 11 0.945 0.662 39 0.944 0.760


Least Median Regression

Effect of Outliers in Regression

Least Squares (LS) estimator

β̂(LS)∈ argmin

β

n∑i=1

r2i , ri = yi − x′iβ

is adversely affected by a single outlier and has a limiting Breakdownpoint of 0 (Dohono & Huber ’83; Hampel ’75). (n→∞, and p fixed)

The Least Absolute Deviation (LAD) estimator has 0 breakdownpoint.

β̂(LAD)

∈ argminβ

n∑i=1

|ri |,

M-Estimators (Huber 1973) slightly improve the breakdown point

n∑i=1

ρ(ri ), ρ(r) symmetric function




Least Median of Squares (LMS) estimator (Rousseew (1984) )

β̂(LMS)

∈ argminβ

(mediani=1,...,n

|ri |).

LMS highest possible breakdown point of almost 50%.

More generally, Least Quantile of Squares (LQS) estimator:

β̂(LQS)

∈ argminβ

|r(q)|,

where, r(q) is the qth ordered absolute residual:

|r(1)| ≤ |r(2)| ≤ . . . ≤ |r(n)|.



LMS Computation : State of the art

LMS problem is NP-hard (Bernholt 2005).

Exact Algorithms

Enumeration based, branch and bound, scale like O(np).Practically Exact algorithms scale up to n = 50, p = 5.

Heuristic Algorithms

Based on heuristic subsampling / local searches.Practically scale significantly better, but no guarantees.



Problem we address

Solve the following problem:

minβ|r(q)|,

where, ri = yi − x′iβ, q is a quantile.

Our approach extends to

minβ|r(q)|, subject to Aβ ≤ b (and/or ‖β‖22 ≤ δ)



Overview of our approach

1 Write the LMS problem as a MIO.

2 Using first order methods we find feasible solutions which we use aswarm-starts.

3 First order solutions serve as warm-starts that enhance running times.



MIO Formulation

Notation:|r(1)| ≤ |r(2)| ≤ . . . ≤ |r(n)|.

Step 1: Introduce binary variables zi , i = 1, . . . , n such that:

zi =

{1, if |ri | ≤ |r(q)|,0, otherwise.

Step 2: Use auxiliary continuous variables µi , µi ≥ 0 such that:

|ri | − µi ≤ |r(q)| ≤ |ri |+ µi , i = 1, . . . , n,

with the conditions:

If |ri | ≥ |r(q)|, then µi = 0, µi ≥ 0,

and if |ri | ≤ |r(q)|, then µi = 0, µi ≥ 0.



MIO Formulation

min γ

subject to |ri |+ µi ≥ γ, i = 1 . . . , n

γ ≥ |ri | − µi , i = 1 . . . , n

Muzi ≥ µi , i = 1, . . . , n

M`(1− zi ) ≥ µi , i = 1, . . . , nn∑

i=1

zi = q

µi ≥ 0, i = 1, . . . , n

µi ≥ 0, i = 1, . . . , n

zi ∈ {0, 1}, i = 1, . . . , n,

where γ, zi , µi , µi , i = 1, . . . , n are decision variables and Mu,M` are Big-Mconstants.



SOS-formulation

min γ

subject to |ri | − γ = µi − µi , i = 1 . . . , nn∑

i=1

zi = q

γ ≥ µi , i = 1 . . . , n

µi ≥ 0, i = 1 . . . , n

µi ≥ 0, i = 1, . . . , n

(µi , µi ) : SOS-1, i = 1, . . . , n

(zi , µi ) : SOS-1, i = 1, . . . , n

zi ∈ {0, 1}, i = 1, . . . , n.



Typical Evolution of MIO

0 20 40 60 80

0.0

0.1

0.2

0.3

0.4

Alcohol Data; (n,p,q) = (44,5,31)

Time (s)

Obj

ectiv

e V

alue

Upper BoundsLower BoundsOptimal Solution

0 200 400 600 8000.

00.

10.

20.

30.

4

Alcohol Data; (n,p,q) = (44,7,31)

Time (s)

Obj

ectiv

e V

alue




Expressing Hq+1(β) a LO

Hq+1(β) can be expressed as

Hq+1(β) := maxw

n∑i=1

wi |yi − x′iβ|

subject to

n∑i=1

wi = q + 1

0 ≤ wi ≤ 1, i = 1, . . . , n.

By taking the dual and invoking strong duality, we have:

Hq+1(β) = minθ,ν

θ (q + 1) +n∑

i=1

νi

subject to θ + νi ≥ |yi − x′iβ|, i = 1, . . . , n

νi ≥ 0, i = 1, . . . , n.



Expressing Hq(β)

Hq(β) can be expressed as

Hm(β) := maxw

n∑i=1

wi |yi − x′iβ|

subject to

n∑i=1

wi = m

0 ≤ wi ≤ 1, i = 1, . . . , n.

Subgradients:

∂ Hq(β) = conv

{n∑

i=1

w∗i sgn(yi − x′iβ)xi : w∗ ∈ argmaxw∈Wq

L(β,w)

},

Hq(β) ≈ Hq(βk) + 〈∂Hq(βk),β − βk〉.



Impact of Warm-Starts

Evolution of MIO (cold-start) [top] vs (warm-start) [bottom]

0 500 1000 1500 2000

010

2030

4050

Time (s)

Objec

tive Va

lue


0 500 1000 1500 2000

02

46

810

12

Time (s)

Objec

tive Va

lue


n = 501, p = 5, synthetic example



Conclusions

MIO+warm-starts solves to provable optimality problems of medium(n = 500) size problems in under two hours.

MIO+warm-starts finds high quality solutions for large (n = 10, 000) scaleproblems in under two hours outperforming all state of the art algorithmsthat are publicly available for the LQS problem.

For problems of this size (n = 10, 000), MIO does not provide a certificate ofoptimality in a reasonable amount of time.



Remarks on Complexity:

A key requirement of a theory is to be positively correlated with empiricalevidence.

Consider the Simplex method and solving the TSP.

A 200 billion speed up forces us to reconsider what is tractable.

Definition: A problem is tractable if it can be solved for sizes and in timesthat are appropriate for the application.

Online trading problems need to be solved in milliseconds.

Regression problems used for planning need to be solved in minutes or inhours.

Asymptotic polynomial solvability or NP-hardness is not relevant under thisdefinition.



Remarks on Statistics

Problems in statistics and machine learning considered intractable ageneration ago are now tractable.

Lasso is a robustness property not a sparsity inducing property as widelybelieved.

In comparison to Lasso, MIO provides a significance edge for detectingsparsity as well as an edge on predictive accuracy.

It is my view that the time to include MIO in statistics curricula has come.

Correspondingly, I will be teaching a first year graduate level class at MIT:Statistics under a modern optimization lens.


statistics and machine learning via a modern optimization...

Documents