Download - Large Scale Optimization for Machine Learning...Lecture 21 [email protected]. Announcements: • Midterm exams • Return it next Tuesday 1. Non-smooth Objective Function • Sub-gradient

LargeScaleOptimizationforMachineLearning

Meisam RazaviyaynLecture 21

[email protected]

Announcements:

• Midtermexams

• ReturnitnextTuesday

1

Non-smooth Objective Function• Sub-gradient

• Typicallyslowandnogoodterminationcriteria(otherthancrossvalidation)

• ProximalGradient• Fastassumingeachiterationiseasy

• BlockCoordinateDescent• Alsohelpfulforexploitingmulti-blockstructure

• AlternatingDirectionMethodofMultipliers(ADMM)• Willbecoveredlater

1

Multi-Block Structure and BCD Method

BlockCoordinateDescent(BCD)Method:

Simpleandscalable:Lassoexample

Atiterationr,chooseanindexi and

Choiceofindexi:Cyclic,randomized,Greedy

2

VerydifferentthanpreviousincrementalGD,SGD,…

GeometricInterpretation

Convergence of BCD Method

3

Assumptions:• SeparableConstraints

• Differentiable/smoothobjective

• UniqueminimizerateachstepNecessaryassumptions?

“Nonlinearprogramming”,D.P.Bertsekas forcyclicupdaterule

Proof?

Necessity of Smoothness Assumption

4

Not“Regular”

Examples:Lasso

smooth

BCD and Non-smooth Objective

5

Theorem[Tseng2001]Assume1) Feasiblesetiscompact.2) Theuniquenessofminimizerateachstep.3) Separableconstraint4) Regularobjectivefunction

Everylimitpointoftheiteratesisastationarypoint

RateofconvergenceofBCD:• SimilartoGD:sublinearforgeneralconvexandlinearforstronglyconvex

• Sameresultscanbeshowninmostofthenon-smoothpopularobjectives

Definitionofstationarityfornonsmooth?Trueforcyclic/randomized/greedyrule

Uniqueness of the Minimizer

6

[MichaelJ.D.Powell1973]

Uniqueness of the Minimizer

7

TensorPARAFACDecomposition

NP-hard[Hastad 1990]

[Carroll1970],[Harshman1970]:AlternatingLeastSquares 0 50 100 150 200 250 3000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Iterates

Error

“Swamp”effect

BCD Limitations

8

• Uniquenessofminimizer

• Eachsub-problemneedstobeeasilysolvable

BCD Limitations

8



PopularSolution:InexactBCD


BCD Limitations

8



PopularSolution:InexactBCD


Localapproximationoftheobjectivefunction

Blocksuccessiveupper-boundminimization,blocksuccessiveconvexapproximation,convex-concaveprocedure,majorization minimization,dc-programming,BCGD,…

Idea of Block Successive Upper-bound Minimization

9

Globalupper-bound:


9

Locallytight:

Globalupper-bound:


9

Locallytight:

Globalupper-bound:

MonotoneAlgorithm

Everylimitpointisastationarypoint

Example 1: Block Coordinate (Proximal) Gradient Descent

10

SmoothScenario:


10

SmoothScenario:

Non-smoothScenario:


10

SmoothScenario:

Non-smoothScenario:

UsingBregman divergence


10

SmoothScenario:

Non-smoothScenario:

AlternatingProximalMinimization:

UsingBregman divergence

Example 2: Expectation Maximization Algorithm

11

Example 2: Expectation Maximization Algorithm

11

Jensen’sinequality

Example 3: Transcript Abundance Estimation

12

24

levels ⇢1

, . . . , ⇢M can be written as

Pr (R1

, . . . , RN ; ⇢1

, . . . , ⇢M ) =

NY

n=1

Pr (Rn; ⇢1 . . . ⇢M )

=

NY

n=1

MX

m=1

Pr (Rn | read Rn from sequence sm)Pr(sm)

!

=

NY

n=1

MX

m=1

↵nm⇢m

!,

where ↵nm , Pr (Rn | read Rn from sequence sm) can be obtained efficiently using an alignment

algorithm such as the ones based on the Burrows-Wheeler transform; see, e.g., [85], [86]. Therefore,

given {↵nm}n,m, the maximum likelihood estimation of the abundance levels can be stated as

b⇢ML = argmin

⇢�

NX

n=1

log

MX

m=1

↵nm⇢m

!

s.t.MX

m=1

⇢m = 1, and ⇢m � 0, 8m = 1, . . . ,M.

(36)

As a special case of the EM algorithm, a popular approach for solving this optimization problem is

to successively minimize a local tight upper-bound of the objective function. In particular, the eXpress

software [87] solves the following optimization problem at the r-th iteration of the algorithm:

⇢r+1

= argmin

⇢�

NX

n=1

MX

m=1

↵nm⇢rmPM

m0=1

↵nm0⇢rm0

log

✓⇢m⇢rm

◆!+ log

MX

m=1

↵nm⇢rm

!!

s.t.MX

m=1

⇢m = 1, and ⇢m � 0, 8m = 1, . . . ,M.

(37)

Using Jensen’s inequality, it is not hard to check that (37) is a valid upper-bound of (36) in the BSUM

framework. Moreover, (37) has a closed form solution obtained by

⇢r+1

m =

1

N

NX

n=1

↵nm⇢rmPMm0

=1

↵nm0⇢rm0

, 8m = 1, . . . ,M,

which makes the algorithm computationally efficient at each step.

For another application of the BSUM algorithm in classical genetics, the readers are referred to the

traditional gene counting algorithm [88].

2) Tensor decomposition: The CANDECOMP/PARAFAC (CP) decomposition has applications in

different areas such as chemometrics [89], [90], clustering [91], and compression [92]. For the ease

March 10, 2015 DRAFT


12

24

levels ⇢1


Pr (R1

, . . . , RN ; ⇢1

, . . . , ⇢M ) =

NY

n=1

Pr (Rn; ⇢1 . . . ⇢M )

=

NY

n=1

MX

m=1


!

=

NY

n=1

MX

m=1

↵nm⇢m

!,




b⇢ML = argmin

⇢�

NX

n=1

log

MX

m=1

↵nm⇢m

!

s.t.MX

m=1

⇢m = 1, and ⇢m � 0, 8m = 1, . . . ,M.

(36)




⇢r+1

= argmin

⇢�

NX

n=1

MX

m=1

↵nm⇢rmPM

m0=1

↵nm0⇢rm0

log

✓⇢m⇢rm

◆!+ log

MX

m=1

↵nm⇢rm

!!

s.t.MX

m=1

⇢m = 1, and ⇢m � 0, 8m = 1, . . . ,M.

(37)



⇢r+1

m =

1

N

NX

n=1

↵nm⇢rmPMm0

=1

↵nm0⇢rm0

, 8m = 1, . . . ,M,







24

levels ⇢1


Pr (R1

, . . . , RN ; ⇢1

, . . . , ⇢M ) =

NY

n=1

Pr (Rn; ⇢1 . . . ⇢M )

=

NY

n=1

MX

m=1


!

=

NY

n=1

MX

m=1

↵nm⇢m

!,




b⇢ML = argmin

⇢�

NX

n=1

log

MX

m=1

↵nm⇢m

!

s.t.MX

m=1

⇢m = 1, and ⇢m � 0, 8m = 1, . . . ,M.

(36)




⇢r+1

= argmin

⇢�

NX

n=1

MX

m=1

↵nm⇢rmPM

m0=1

↵nm0⇢rm0

log

✓⇢m⇢rm

◆!+ log

MX

m=1

↵nm⇢rm

!!

s.t.MX

m=1

⇢m = 1, and ⇢m � 0, 8m = 1, . . . ,M.

(37)



⇢r+1

m =

1

N

NX

n=1

↵nm⇢rmPMm0

=1

↵nm0⇢rm0

, 8m = 1, . . . ,M,








13

24

levels ⇢1


Pr (R1

, . . . , RN ; ⇢1

, . . . , ⇢M ) =

NY

n=1

Pr (Rn; ⇢1 . . . ⇢M )

=

NY

n=1

MX

m=1


!

=

NY

n=1

MX

m=1

↵nm⇢m

!,




b⇢ML = argmin

⇢�

NX

n=1

log

MX

m=1

↵nm⇢m

!

s.t.MX

m=1

⇢m = 1, and ⇢m � 0, 8m = 1, . . . ,M.

(36)




⇢r+1

= argmin

⇢�

NX

n=1

MX

m=1

↵nm⇢rmPM

m0=1

↵nm0⇢rm0

log

✓⇢m⇢rm

◆!+ log

MX

m=1

↵nm⇢rm

!!

s.t.MX

m=1

⇢m = 1, and ⇢m � 0, 8m = 1, . . . ,M.

(37)



⇢r+1

m =

1

N

NX

n=1

↵nm⇢rmPMm0

=1

↵nm0⇢rm0

, 8m = 1, . . . ,M,







24

levels ⇢1


Pr (R1

, . . . , RN ; ⇢1

, . . . , ⇢M ) =

NY

n=1

Pr (Rn; ⇢1 . . . ⇢M )

=

NY

n=1

MX

m=1


!

=

NY

n=1

MX

m=1

↵nm⇢m

!,




b⇢ML = argmin

⇢�

NX

n=1

log

MX

m=1

↵nm⇢m

!

s.t.MX

m=1

⇢m = 1, and ⇢m � 0, 8m = 1, . . . ,M.

(36)




⇢r+1

= argmin

⇢�

NX

n=1

MX

m=1

↵nm⇢rmPM

m0=1

↵nm0⇢rm0

log

✓⇢m⇢rm

◆!+ log

MX

m=1

↵nm⇢rm

!!

s.t.MX

m=1

⇢m = 1, and ⇢m � 0, 8m = 1, . . . ,M.

(37)



⇢r+1

m =

1

N

NX

n=1

↵nm⇢rmPMm0

=1

↵nm0⇢rm0

, 8m = 1, . . . ,M,








13

24

levels ⇢1


Pr (R1

, . . . , RN ; ⇢1

, . . . , ⇢M ) =

NY

n=1

Pr (Rn; ⇢1 . . . ⇢M )

=

NY

n=1

MX

m=1


!

=

NY

n=1

MX

m=1

↵nm⇢m

!,




b⇢ML = argmin

⇢�

NX

n=1

log

MX

m=1

↵nm⇢m

!

s.t.MX

m=1

⇢m = 1, and ⇢m � 0, 8m = 1, . . . ,M.

(36)




⇢r+1

= argmin

⇢�

NX

n=1

MX

m=1

↵nm⇢rmPM

m0=1

↵nm0⇢rm0

log

✓⇢m⇢rm

◆!+ log

MX

m=1

↵nm⇢rm

!!

s.t.MX

m=1

⇢m = 1, and ⇢m � 0, 8m = 1, . . . ,M.

(37)



⇢r+1

m =

1

N

NX

n=1

↵nm⇢rmPMm0

=1

↵nm0⇢rm0

, 8m = 1, . . . ,M,







24

levels ⇢1


Pr (R1

, . . . , RN ; ⇢1

, . . . , ⇢M ) =

NY

n=1

Pr (Rn; ⇢1 . . . ⇢M )

=

NY

n=1

MX

m=1


!

=

NY

n=1

MX

m=1

↵nm⇢m

!,




b⇢ML = argmin

⇢�

NX

n=1

log

MX

m=1

↵nm⇢m

!

s.t.MX

m=1

⇢m = 1, and ⇢m � 0, 8m = 1, . . . ,M.

(36)




⇢r+1

= argmin

⇢�

NX

n=1

MX

m=1

↵nm⇢rmPM

m0=1

↵nm0⇢rm0

log

✓⇢m⇢rm

◆!+ log

MX

m=1

↵nm⇢rm

!!

s.t.MX

m=1

⇢m = 1, and ⇢m � 0, 8m = 1, . . . ,M.

(37)



⇢r+1

m =

1

N

NX

n=1

↵nm⇢rmPMm0

=1

↵nm0⇢rm0

, 8m = 1, . . . ,M,







24

levels ⇢1


Pr (R1

, . . . , RN ; ⇢1

, . . . , ⇢M ) =

NY

n=1

Pr (Rn; ⇢1 . . . ⇢M )

=

NY

n=1

MX

m=1


!

=

NY

n=1

MX

m=1

↵nm⇢m

!,




b⇢ML = argmin

⇢�

NX

n=1

log

MX

m=1

↵nm⇢m

!

s.t.MX

m=1

⇢m = 1, and ⇢m � 0, 8m = 1, . . . ,M.

(36)




⇢r+1

= argmin

⇢�

NX

n=1

MX

m=1

↵nm⇢rmPM

m0=1

↵nm0⇢rm0

log

✓⇢m⇢rm

◆!+ log

MX

m=1

↵nm⇢rm

!!

s.t.MX

m=1

⇢m = 1, and ⇢m � 0, 8m = 1, . . . ,M.

(37)



⇢r+1

m =

1

N

NX

n=1

↵nm⇢rmPMm0

=1

↵nm0⇢rm0

, 8m = 1, . . . ,M,







Closedformupdate!

Download - Large Scale Optimization for Machine Learning...Lecture 21 [email protected]. Announcements: • Midterm exams • Return it next Tuesday 1. Non-smooth Objective Function • Sub-gradient

Top Related