# large scale optimization for machine learning ... lecture 21 razaviya@usc.edu. announcements: •...

Post on 20-Aug-2020

1 views

Embed Size (px)

TRANSCRIPT

Large Scale Optimization for Machine Learning

Meisam Razaviyayn Lecture 21

razaviya@usc.edu

Announcements:

• Midterm exams

• Return it next Tuesday

1

Non-smooth Objective Function • Sub-gradient

• Typically slow and no good termination criteria (other than cross validation)

• Proximal Gradient • Fast assuming each iteration is easy

• Block Coordinate Descent • Also helpful for exploiting multi-block structure

• Alternating Direction Method of Multipliers (ADMM) • Will be covered later

1

Multi-Block Structure and BCD Method

Block Coordinate Descent (BCD) Method:

Simple and scalable: Lasso example

At iteration r, choose an index i and

Choice of index i: Cyclic, randomized, Greedy

2

Very different than previous incremental GD, SGD,…

Geometric Interpretation

Convergence of BCD Method

3

Assumptions: • Separable Constraints

• Differentiable/smooth objective

• Unique minimizer at each step Necessary assumptions?

“Nonlinear programming”, D.P. Bertsekas for cyclic update rule

Proof?

Necessity of Smoothness Assumption

4

Not “Regular”

Examples: Lasso

smooth

BCD and Non-smooth Objective

5

Theorem [Tseng 2001] Assume 1) Feasible set is compact. 2) The uniqueness of minimizer at each step. 3) Separable constraint 4) Regular objective function

Every limit point of the iterates is a stationary point

Rate of convergence of BCD: • Similar to GD: sublinear for general convex and linear for strongly convex

• Same results can be shown in most of the non-smooth popular objectives

Definition of stationarity for nonsmooth? True for cyclic/randomized/greedy rule

Uniqueness of the Minimizer

6

[Michael J. D. Powell 1973]

Uniqueness of the Minimizer

7

Tensor PARAFAC Decomposition

NP-hard [Hastad 1990]

[Carroll 1970], [Harshman1970]: Alternating Least Squares 0 50 100 150 200 250 3000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Iterates

Er ro r

“Swamp” effect

BCD Limitations

8

• Uniqueness of minimizer

• Each sub-problem needs to be easily solvable

BCD Limitations

8

• Uniqueness of minimizer

• Each sub-problem needs to be easily solvable

Popular Solution: Inexact BCD

At iteration r, choose an index i and

BCD Limitations

8

• Uniqueness of minimizer

• Each sub-problem needs to be easily solvable

Popular Solution: Inexact BCD

At iteration r, choose an index i and

Local approximation of the objective function

Block successive upper-bound minimization, block successive convex approximation, convex-concave procedure, majorization minimization, dc-programming, BCGD,…

Idea of Block Successive Upper-bound Minimization

9

Global upper-bound:

Idea of Block Successive Upper-bound Minimization

9

Locally tight:

Global upper-bound:

Idea of Block Successive Upper-bound Minimization

9

Locally tight:

Global upper-bound:

Monotone Algorithm

Every limit point is a stationary point

Example 1: Block Coordinate (Proximal) Gradient Descent

10

Smooth Scenario:

Example 1: Block Coordinate (Proximal) Gradient Descent

10

Smooth Scenario:

Non-smooth Scenario:

Example 1: Block Coordinate (Proximal) Gradient Descent

10

Smooth Scenario:

Non-smooth Scenario:

Using Bregman divergence

Example 1: Block Coordinate (Proximal) Gradient Descent

10

Smooth Scenario:

Non-smooth Scenario:

Alternating Proximal Minimization:

Using Bregman divergence

Example 2: Expectation Maximization Algorithm

11

Example 2: Expectation Maximization Algorithm

11

Example 2: Expectation Maximization Algorithm

11

Jensen’s inequality

Example 3: Transcript Abundance Estimation

12

24

levels ⇢ 1

, . . . , ⇢M can be written as

Pr (R 1

, . . . , RN ; ⇢1, . . . , ⇢M ) = NY

n=1

Pr (Rn; ⇢1 . . . ⇢M )

=

NY

n=1

MX

m=1

Pr (Rn | read Rn from sequence sm)Pr(sm) !

=

NY

n=1

MX

m=1

↵nm⇢m

! ,

where ↵nm , Pr (Rn | read Rn from sequence sm) can be obtained efficiently using an alignment algorithm such as the ones based on the Burrows-Wheeler transform; see, e.g., [85], [86]. Therefore,

given {↵nm}n,m, the maximum likelihood estimation of the abundance levels can be stated as

b⇢ML = argmin ⇢

� NX

n=1

log

MX

m=1

↵nm⇢m

!

s.t. MX

m=1

⇢m = 1, and ⇢m � 0, 8m = 1, . . . ,M. (36)

As a special case of the EM algorithm, a popular approach for solving this optimization problem is

to successively minimize a local tight upper-bound of the objective function. In particular, the eXpress

software [87] solves the following optimization problem at the r-th iteration of the algorithm:

⇢r+1 = argmin ⇢

� NX

n=1

MX

m=1

↵nm⇢rmPM

m0=1 ↵nm0⇢ r m0

log

✓ ⇢m ⇢rm

◆! + log

MX

m=1

↵nm⇢ r m

!!

s.t. MX

m=1

⇢m = 1, and ⇢m � 0, 8m = 1, . . . ,M. (37)

Using Jensen’s inequality, it is not hard to check that (37) is a valid upper-bound of (36) in the BSUM

framework. Moreover, (37) has a closed form solution obtained by

⇢r+1m = 1

N

NX

n=1

↵nm⇢rmPM m0=1 ↵nm0⇢

r m0

, 8m = 1, . . . ,M,

which makes the algorithm computationally efficient at each step.

For another application of the BSUM algorithm in classical genetics, the readers are referred to the

traditional gene counting algorithm [88].

2) Tensor decomposition: The CANDECOMP/PARAFAC (CP) decomposition has applications in

different areas such as chemometrics [89], [90], clustering [91], and compression [92]. For the ease

March 10, 2015 DRAFT

Example 3: Transcript Abundance Estimation

12

24

levels ⇢ 1

, . . . , ⇢M can be written as

Pr (R 1

, . . . , RN ; ⇢1, . . . , ⇢M ) = NY

n=1

Pr (Rn; ⇢1 . . . ⇢M )

=

NY

n=1

MX

m=1

Pr (Rn | read Rn from sequence sm)Pr(sm) !

=

NY

n=1

MX

m=1

↵nm⇢m

! ,

where ↵nm , Pr (Rn | read Rn from sequence sm) can be obtained efficiently using an alignment algorithm such as the ones based on the Burrows-Wheeler transform; see, e.g., [85], [86]. Therefore,

given {↵nm}n,m, the maximum likelihood estimation of the abundance levels can be stated as

b⇢ML = argmin ⇢

� NX

n=1

log

MX

m=1

↵nm⇢m

!

s.t. MX

m=1

⇢m = 1, and ⇢m � 0, 8m = 1, . . . ,M. (36)

As a special case of the EM algorithm, a popular approach for solving this optimization problem is

to successively minimize a local tight upper-bound of the objective function. In particular, the eXpress

software [87] solves the following optimization problem at the r-th iteration of the algorithm:

⇢r+1 = argmin ⇢

� NX

n=1

MX

m=1

↵nm⇢rmPM

m0=1 ↵nm0⇢ r m0

log

✓ ⇢m ⇢rm

◆! + log

MX

m=1

↵nm⇢ r m

!!

s.t. MX

m=1

⇢m = 1, and ⇢m � 0, 8m = 1, . . . ,M. (37)

Using Jensen’s inequality, it is not hard to check that (37) is a valid upper-bound of (36) in the BSUM

framework. Moreover, (37) has a closed form solution obtained by

⇢r+1m = 1

N

NX

n=1

↵nm⇢rmPM m0=1 ↵nm0⇢

r m0

, 8m = 1, . . . ,M,

which makes the algorithm computationally efficient at each step.

For another application of the BSUM algorithm in classical genetics, the readers are referred to the

traditional gene counting algorithm [88].

2) Tensor decomposition: The CANDECOMP/PARAFAC (CP) decomposition has applications in

different areas such as chemometrics [89], [90], clustering [91], and compression [92]. For the ease

March 10, 2015 DRAFT

24

levels ⇢ 1

, . . . , ⇢M can be written as

Pr (R 1

, . . . , RN ; ⇢1, . . . , ⇢M ) = NY

n=1

Pr (Rn; ⇢1 . . . ⇢M )

=

NY

n=1

MX

m=1

Pr (Rn | read Rn from sequence sm)Pr(sm) !

=

NY

n=1

MX

m=1

↵nm⇢m

! ,

where ↵nm , Pr (Rn | read Rn fr