large scale optimization for machine learning ... lecture 21 razaviya@usc.edu. announcements: •...

Click here to load reader

Post on 20-Aug-2020

1 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Large Scale Optimization for Machine Learning

    Meisam Razaviyayn Lecture 21

    razaviya@usc.edu

  • Announcements:

    • Midterm exams

    • Return it next Tuesday

    1

  • Non-smooth Objective Function • Sub-gradient

    • Typically slow and no good termination criteria (other than cross validation)

    • Proximal Gradient • Fast assuming each iteration is easy

    • Block Coordinate Descent • Also helpful for exploiting multi-block structure

    • Alternating Direction Method of Multipliers (ADMM) • Will be covered later

    1

  • Multi-Block Structure and BCD Method

    Block Coordinate Descent (BCD) Method:

    Simple and scalable: Lasso example

    At iteration r, choose an index i and

    Choice of index i: Cyclic, randomized, Greedy

    2

    Very different than previous incremental GD, SGD,…

    Geometric Interpretation

  • Convergence of BCD Method

    3

    Assumptions: • Separable Constraints

    • Differentiable/smooth objective

    • Unique minimizer at each step Necessary assumptions?

    “Nonlinear programming”, D.P. Bertsekas for cyclic update rule

    Proof?

  • Necessity of Smoothness Assumption

    4

    Not “Regular”

    Examples: Lasso

    smooth

  • BCD and Non-smooth Objective

    5

    Theorem [Tseng 2001] Assume 1) Feasible set is compact. 2) The uniqueness of minimizer at each step. 3) Separable constraint 4) Regular objective function

    Every limit point of the iterates is a stationary point

    Rate of convergence of BCD: • Similar to GD: sublinear for general convex and linear for strongly convex

    • Same results can be shown in most of the non-smooth popular objectives

    Definition of stationarity for nonsmooth? True for cyclic/randomized/greedy rule

  • Uniqueness of the Minimizer

    6

    [Michael J. D. Powell 1973]

  • Uniqueness of the Minimizer

    7

    Tensor PARAFAC Decomposition

    NP-hard [Hastad 1990]

    [Carroll 1970], [Harshman1970]: Alternating Least Squares 0 50 100 150 200 250 3000

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    Iterates

    Er ro r

    “Swamp” effect

  • BCD Limitations

    8

    • Uniqueness of minimizer

    • Each sub-problem needs to be easily solvable

  • BCD Limitations

    8

    • Uniqueness of minimizer

    • Each sub-problem needs to be easily solvable

    Popular Solution: Inexact BCD

    At iteration r, choose an index i and

  • BCD Limitations

    8

    • Uniqueness of minimizer

    • Each sub-problem needs to be easily solvable

    Popular Solution: Inexact BCD

    At iteration r, choose an index i and

    Local approximation of the objective function

    Block successive upper-bound minimization, block successive convex approximation, convex-concave procedure, majorization minimization, dc-programming, BCGD,…

  • Idea of Block Successive Upper-bound Minimization

    9

    Global upper-bound:

  • Idea of Block Successive Upper-bound Minimization

    9

    Locally tight:

    Global upper-bound:

  • Idea of Block Successive Upper-bound Minimization

    9

    Locally tight:

    Global upper-bound:

    Monotone Algorithm

    Every limit point is a stationary point

  • Example 1: Block Coordinate (Proximal) Gradient Descent

    10

    Smooth Scenario:

  • Example 1: Block Coordinate (Proximal) Gradient Descent

    10

    Smooth Scenario:

    Non-smooth Scenario:

  • Example 1: Block Coordinate (Proximal) Gradient Descent

    10

    Smooth Scenario:

    Non-smooth Scenario:

    Using Bregman divergence

  • Example 1: Block Coordinate (Proximal) Gradient Descent

    10

    Smooth Scenario:

    Non-smooth Scenario:

    Alternating Proximal Minimization:

    Using Bregman divergence

  • Example 2: Expectation Maximization Algorithm

    11

  • Example 2: Expectation Maximization Algorithm

    11

  • Example 2: Expectation Maximization Algorithm

    11

    Jensen’s inequality

  • Example 3: Transcript Abundance Estimation

    12

    24

    levels ⇢ 1

    , . . . , ⇢M can be written as

    Pr (R 1

    , . . . , RN ; ⇢1, . . . , ⇢M ) = NY

    n=1

    Pr (Rn; ⇢1 . . . ⇢M )

    =

    NY

    n=1

    MX

    m=1

    Pr (Rn | read Rn from sequence sm)Pr(sm) !

    =

    NY

    n=1

    MX

    m=1

    ↵nm⇢m

    ! ,

    where ↵nm , Pr (Rn | read Rn from sequence sm) can be obtained efficiently using an alignment algorithm such as the ones based on the Burrows-Wheeler transform; see, e.g., [85], [86]. Therefore,

    given {↵nm}n,m, the maximum likelihood estimation of the abundance levels can be stated as

    b⇢ML = argmin ⇢

    � NX

    n=1

    log

    MX

    m=1

    ↵nm⇢m

    !

    s.t. MX

    m=1

    ⇢m = 1, and ⇢m � 0, 8m = 1, . . . ,M. (36)

    As a special case of the EM algorithm, a popular approach for solving this optimization problem is

    to successively minimize a local tight upper-bound of the objective function. In particular, the eXpress

    software [87] solves the following optimization problem at the r-th iteration of the algorithm:

    ⇢r+1 = argmin ⇢

    � NX

    n=1

    MX

    m=1

    ↵nm⇢rmPM

    m0=1 ↵nm0⇢ r m0

    log

    ✓ ⇢m ⇢rm

    ◆! + log

    MX

    m=1

    ↵nm⇢ r m

    !!

    s.t. MX

    m=1

    ⇢m = 1, and ⇢m � 0, 8m = 1, . . . ,M. (37)

    Using Jensen’s inequality, it is not hard to check that (37) is a valid upper-bound of (36) in the BSUM

    framework. Moreover, (37) has a closed form solution obtained by

    ⇢r+1m = 1

    N

    NX

    n=1

    ↵nm⇢rmPM m0=1 ↵nm0⇢

    r m0

    , 8m = 1, . . . ,M,

    which makes the algorithm computationally efficient at each step.

    For another application of the BSUM algorithm in classical genetics, the readers are referred to the

    traditional gene counting algorithm [88].

    2) Tensor decomposition: The CANDECOMP/PARAFAC (CP) decomposition has applications in

    different areas such as chemometrics [89], [90], clustering [91], and compression [92]. For the ease

    March 10, 2015 DRAFT

  • Example 3: Transcript Abundance Estimation

    12

    24

    levels ⇢ 1

    , . . . , ⇢M can be written as

    Pr (R 1

    , . . . , RN ; ⇢1, . . . , ⇢M ) = NY

    n=1

    Pr (Rn; ⇢1 . . . ⇢M )

    =

    NY

    n=1

    MX

    m=1

    Pr (Rn | read Rn from sequence sm)Pr(sm) !

    =

    NY

    n=1

    MX

    m=1

    ↵nm⇢m

    ! ,

    where ↵nm , Pr (Rn | read Rn from sequence sm) can be obtained efficiently using an alignment algorithm such as the ones based on the Burrows-Wheeler transform; see, e.g., [85], [86]. Therefore,

    given {↵nm}n,m, the maximum likelihood estimation of the abundance levels can be stated as

    b⇢ML = argmin ⇢

    � NX

    n=1

    log

    MX

    m=1

    ↵nm⇢m

    !

    s.t. MX

    m=1

    ⇢m = 1, and ⇢m � 0, 8m = 1, . . . ,M. (36)

    As a special case of the EM algorithm, a popular approach for solving this optimization problem is

    to successively minimize a local tight upper-bound of the objective function. In particular, the eXpress

    software [87] solves the following optimization problem at the r-th iteration of the algorithm:

    ⇢r+1 = argmin ⇢

    � NX

    n=1

    MX

    m=1

    ↵nm⇢rmPM

    m0=1 ↵nm0⇢ r m0

    log

    ✓ ⇢m ⇢rm

    ◆! + log

    MX

    m=1

    ↵nm⇢ r m

    !!

    s.t. MX

    m=1

    ⇢m = 1, and ⇢m � 0, 8m = 1, . . . ,M. (37)

    Using Jensen’s inequality, it is not hard to check that (37) is a valid upper-bound of (36) in the BSUM

    framework. Moreover, (37) has a closed form solution obtained by

    ⇢r+1m = 1

    N

    NX

    n=1

    ↵nm⇢rmPM m0=1 ↵nm0⇢

    r m0

    , 8m = 1, . . . ,M,

    which makes the algorithm computationally efficient at each step.

    For another application of the BSUM algorithm in classical genetics, the readers are referred to the

    traditional gene counting algorithm [88].

    2) Tensor decomposition: The CANDECOMP/PARAFAC (CP) decomposition has applications in

    different areas such as chemometrics [89], [90], clustering [91], and compression [92]. For the ease

    March 10, 2015 DRAFT

    24

    levels ⇢ 1

    , . . . , ⇢M can be written as

    Pr (R 1

    , . . . , RN ; ⇢1, . . . , ⇢M ) = NY

    n=1

    Pr (Rn; ⇢1 . . . ⇢M )

    =

    NY

    n=1

    MX

    m=1

    Pr (Rn | read Rn from sequence sm)Pr(sm) !

    =

    NY

    n=1

    MX

    m=1

    ↵nm⇢m

    ! ,

    where ↵nm , Pr (Rn | read Rn fr

View more