large-scale sparse logistic regression

Center for Evolutionary Functional Genomics

Large-Scale Sparse Logistic Regression

Jieping Ye

Arizona State University

Joint work with Jun Liu and Jianhui Chen


Prediction: Disease or not Confidence (probability) Identify Informative features

Sparse Logistic Regression


Logistic Regression

Logistic Regression (LR) has been applied to Document classification (Brzezinski, 1999)

Natural language processing (Jurafsky and Martin, 2000)

Computer vision (Friedman et al., 2000)

Bioinformatics (Liao and Chin, 2007)

Regularization is commonly applied to reduce overfitting and obtain a robust classifier. Two well-known regularizations are:

L2-norm regularization (Minka, 2007)

L1-norm regularization (Koh et al., 2007)



L1-norm regularization leads to sparse logistic regression (SLR) Simultaneous feature selection and classification Enhanced model interpretability Improved classification performance

Applications M.-Y. Park and T. Hastie, Penalized Logistic Regression for

Detecting Gene Interactions. Biostatistics, 2008. T. Wu et al. Genomewide Association Analysis by Lasso Penalized

Logistic Regression. Bioinformatics, 2009.


Large-Scale Sparse Logistic Regression

Many applications involve data of large dimensionality

The MRI images used in Alzheimer’s Disease study contain more than 1 million voxels (features)

Major Challenge How to scale sparse logistic

regression to large-scale problems?


The Proposed Lassplore Algorithm

Lassplore (LArge-Scale SParse LOgistic REgression) is a first-order method

Each iteration of Lassplore involves the matrix-vector multiplication onlyScale to large-size problemsEfficient for sparse data

Lassplore achieves the optimal convergence rate among all first-order methods


Outline

Logistic Regression


Lassplore

Experiments


Logistic Regression (1)

Logistic regression model is given by

1

Prob( | ) ( )1 exp ( )

T

Tb a w a c

b w a c

a nR is the sample

{ 1, 1}b is the class label


Logistic Regression (2)

1{a , }mi i ib Given a set of m training data , we can compute w

and c by minimizing the average logistic loss:

1a2a ma

Prob( | ) ( )Ti i ib a w a c

i

Prob( | )i ib a is maximized

overfitting


L1-ball Constrained Logistic Regression

Favorable Properties: Obtaining sparse solution Performing feature selection and classification simultaneously Improving classification performance

How to solve the L1-ball constrained optimization problem?


Gradient Method for Sparse Logistic Regression

Let us consider the gradient descent for solving the optimization problem: min ( )

x Gg x

kx1kx

1 '( ) /k k k kx x g x L


Euclidean Projection onto the L1-Ball

v1

π(v1)

π(v2)

π(v3)

v2

v3

z

y

x

z

0

The Euclidean projection onto the L1-ball (Duchi et al., 2008) is a building block, and it can be solved in linear time (Liu and Ye, 2009).


Gradient Method & Nesterov’s Method (1)

g(.) Gradient Descent Nesterov’s method

smooth and convex O(1/k) O(1/k2)

smooth and strongly convex with conditional number C

21

1

kC

OC

11

k

OC

Convergence rates:

Nesterov’s method achieves the lower-complexity bound of smooth optimization by first-order black-box method, and thus is an optimal method.


Gradient Method & Nesterov’s Method (2)

The theoretical number of iterations (up to a constant factor) for achieving an accuracy of 10-8:

g(.) Gradient Descent Nesterov’s method

smooth and convex 108 104

smooth and strongly convex with conditional number

C= 104

4.6×104 1.8×103


Characteristics of the Lassplore

First-order black-box Oracle based method At each iteration, we only need to evaluate the function value and gradient

Utilizing the Nesterov’s method (Nesterov, 2003)

Global convergence rate of O(1/k2) for the general case

Linear convergence rate for the strongly convex case

An adaptive line search scheme The step size is allowed to increase during the iterations

This line search scheme is applicable to the general smooth convex optimization


Key Components and Settings

xk

sk

xk+1

xk-1

sk=xk+βk(xk-xk-1)xk+1=sk-g'(sk)/Lk

Previous schemes for :Nesterov’s constant scheme (Nesterov, 2003)

Nemirovski’s line search scheme (Nemirovski, 1994)

,k kL

1kx

1 '( ) /k k k kx x g x L kx


Previous Line Search Schemes

k

Nesterov’s constant scheme (Nesterov, 2003): is set to a constant value L, the Lipschitz continuous gradient of the function g(.) is dependent on the conditional number C

kL

k

Nemirovski’s line search scheme (Nemirovski, 1994): is allowed to increase, and upper-bounded by 2L is identical for every function g(.)

kL


Proposed Line Search SchemeCharacteristics: is allowed to adaptively tuned (increasing and decreasing) and

upper-bounded by 2L is dependent on It preserves the optimal convergence rate (technical proof refers to the

paper)

kL

kkL


Related Work Y. Nesterov. Gradient methods for minimizing composite

objective function (Technical Report 2007/76). S. Becker, J. Bobin, and E. J. Candès. NESTA: a fast and

accurate first-order method for sparse recovery. 2009. A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding

algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2, 183-202, 2009.

K.-C. Toh and S. Yun. An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Preprint, National University of Singapore, March 2009.

S. Ji and J. Ye. An Accelerated Gradient Method for Trace Norm Minimization. The Twenty-Sixth International Conference on Machine Learning, 2009.


Experiments: Data Sets


Comparison of the Line Search Schemes

Comparison the proposed adaptive scheme (Adap) with the one proposed by Nemirovski (Nemi)

kL Objective


Pathwise Solutions: Warm Start vs. Cold Start


Comparison with ProjectionL1 (Schmidt et al., 2007)

Adaptive Scheme


Comparison with l1-logreg (Koh et al., 2007)


Drosophila Gene Expression Image Analysis

Drosophila embryogenesis is divided into 17 developmental stages (1-17)

BDGP

Fly-FISH


Sparse Logistic Regression: Application (2)


Summary

The Lassplore algorithm for sparse logistic regression First-order black-box method Optimal convergence rate Adaptive line search scheme

Future work Apply the proposed approach for other mixed-norm

regularized optimization Biological image analysis


The Lassplore Package

http://www.public.asu.edu/~jye02/Software/lassplore/


Thank you!

large-scale sparse logistic regression

Documents