convex and non-convex worlds in machine learningseminars/seminars/extra/2015... · intro convex...

93
Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary Convex and non-convex worlds in machine learning Anna Choromanska Courant Institute of Mathematical Sciences New York University

Upload: others

Post on 01-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Convex and non-convex worlds in machine learning

Anna Choromanska

Courant Institute of Mathematical SciencesNew York University

Page 2: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Convex and non-convex worlds

Machine learning and optimization - many machine learningproblems are formulated as minimization of some loss function ona training set of examples. Loss functions expresses thediscrepancy between the predictions of the model being trainedand the actual problem instances. Optimization algorithms canthen minimize this loss. (Wikipedia)

Convex world

local min = global minstrictly convex: unique min

efficient solversstrong theoretical guarantees

Non-convex world

multiple local min 6= global minmany solvers come from convex world

weak theoretical guarantees if any

Page 3: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Convex and non-convex worlds

Machine learning and optimization - many machine learningproblems are formulated as minimization of some loss function ona training set of examples. Loss functions expresses thediscrepancy between the predictions of the model being trainedand the actual problem instances. Optimization algorithms canthen minimize this loss. (Wikipedia)

Convex world

local min = global minstrictly convex: unique min

efficient solversstrong theoretical guarantees

Non-convex world

multiple local min 6= global minmany solvers come from convex world

weak theoretical guarantees if any

Page 4: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Convex and non-convex worlds

Machine learning and optimization - many machine learningproblems are formulated as minimization of some loss function ona training set of examples. Loss functions expresses thediscrepancy between the predictions of the model being trainedand the actual problem instances. Optimization algorithms canthen minimize this loss. (Wikipedia)

Convex world

local min = global minstrictly convex: unique min

efficient solversstrong theoretical guarantees

Non-convex world

multiple local min 6= global minmany solvers come from convex world

weak theoretical guarantees if any

Page 5: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Layout of the talk

Optimization solvers: generic optimization vs bound majorizationpartition function-based objectives

Design (convex) solver: quadratic bound majorization[JC12, CKJ12, ACJK14]

Challenging problems: multi-class classificationDesign objective: statistical and computational constraints

online multi-class partition trees for logarithmic time predictions[CL14, CAL13, CCB15, CCJM15, CCJM13, CJKMM13, BCCL15, CM12]

Non-convex problems: deep learninghighly non-convex objective

Build understanding: new theoretical results[CHMCL15, CLB15, ZCL15]

Page 6: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Layout of the talk

Optimization solvers: generic optimization vs bound majorizationpartition function-based objectives

Design (convex) solver: quadratic bound majorization[JC12, CKJ12, ACJK14]

Challenging problems: multi-class classificationDesign objective: statistical and computational constraints

online multi-class partition trees for logarithmic time predictions[CL14, CAL13, CCB15, CCJM15, CCJM13, CJKMM13, BCCL15, CM12]

Non-convex problems: deep learninghighly non-convex objective

Build understanding: new theoretical results[CHMCL15, CLB15, ZCL15]

Page 7: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Layout of the talk

Optimization solvers: generic optimization vs bound majorizationpartition function-based objectives

Design (convex) solver: quadratic bound majorization[JC12, CKJ12, ACJK14]

Challenging problems: multi-class classificationDesign objective: statistical and computational constraints

online multi-class partition trees for logarithmic time predictions[CL14, CAL13, CCB15, CCJM15, CCJM13, CJKMM13, BCCL15, CM12]

Non-convex problems: deep learninghighly non-convex objective

Build understanding: new theoretical results[CHMCL15, CLB15, ZCL15]

Page 8: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

How to build good efficientconvex solver?

Page 9: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Optimization solvers

Generic optimization techniques and majorization methods

Batchsteepest descentconjugate gradientNewton(L)BFGS [B70]

StochasticSGD [RB51]ASGD [PJ92]SAG [LRSB12]SDCA [SSZ13]SVRG [JZ13]

Semi-stochastichybrid deterministic-stochastic methods [FS12]

Majorization methodsMISO [M13]iterative scaling [DR72]EM [DLR77]Quadratic lower bound principle [BL88]

. . .

Page 10: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Majorization

Bound majorization

If cost function θ∗ = arg minθ C (θ) has no closed form solutionMajorization uses a surrogate Q with closed form solutionMonotonically improves from initial θ0

Find bound Q(θ,θi ) ≥ C (θ) where Q(θi ,θi ) = C (θi )

Update θi+1 = arg minθ Q(θ,θi )

Repeat until converged

Page 11: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Majorization

Bound majorization

If cost function θ∗ = arg minθ C (θ) has no closed form solutionMajorization uses a surrogate Q with closed form solutionMonotonically improves from initial θ0

Find bound Q(θ,θi ) ≥ C (θ) where Q(θi ,θi ) = C (θi )

Update θi+1 = arg minθ Q(θ,θi )

Repeat until converged

Page 12: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Majorization

Bound majorization

If cost function θ∗ = arg minθ C (θ) has no closed form solutionMajorization uses a surrogate Q with closed form solutionMonotonically improves from initial θ0

Find bound Q(θ,θi ) ≥ C (θ) where Q(θi ,θi ) = C (θi )

Update θi+1 = arg minθ Q(θ,θi )

Repeat until converged

Page 13: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Majorization

Generic optimization techniques vs majorization methods

Majorization methods preferred until [W03, AG07].

...Why? Slower than other optimizers, because of loose &complicated bounds.

Let’s fix this!!!

Page 14: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Partition Bound

Partition Function

Log-linear model partition functions

Z (θ) =∑y

h(y) exp(θ>f(y))

Partition function ensures that p(y |θ) normalizes.

It is a central quantity to optimize in

maximum likelihood and e-family [P36]

maximum entropy [J57]

conditional random fields [LMP01]

log-linear models [DR72]

graphical models, HMMs [JGJS99].

Problem: it’s ugly to minimize, we much prefer quadratics

Page 15: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Partition Bound

Partition Function Bound

The bound ln Z (θ) ≤ ln z + 12 (θ − θ)>Σ(θ − θ) + (θ − θ)>µ

is tight at θ and holds for parameters given by

Input θ, f(y), h(y) ∀y ∈ Ω

Init z → 0+,µ = 0,Σ = zIFor each y ∈ Ω α = h(y) exp(θ>f(y))r = f(y)− µ

Σ+=tanh( 1

2ln(α/z))

2 ln(α/z) rr>

µ += αz+αr

z += α Output z ,µ,Σ

−5 0 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

θ

log(

Z)

and

Bou

nds

O(nd2) and update via θ ← θ −Σ−1µ in O(d3).

Page 16: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Partition Bound

Partition Function Bound

The bound ln Z (θ) ≤ ln z + 12 (θ − θ)>Σ(θ − θ) + (θ − θ)>µ

is tight at θ and holds for parameters given by

Input θ, f(y), h(y) ∀y ∈ Ω

Init z → 0+,µ = 0,Σ = zIFor each y ∈ Ω α = h(y) exp(θ>f(y))r = f(y)− µ

Σ+=tanh( 1

2ln(α/z))

2 ln(α/z) rr>

µ += αz+αr

z += α Output z ,µ,Σ

−5 0 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

θ

log(

Z)

and

Bou

nds

O(nd2) and update via θ ← θ −Σ−1µ in O(d3).

Page 17: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Bound applications

Conditional Random Fields (CRFs)

Trained on iid data (x1, y1), ..., (xt , yt)Each CRF is a log-linear model

p(y |xj ,θ) =1

Zxj (θ)hxj (y) exp(θ>fxj (y))

Regularized maximum likelihood objective is

J(θ) =t∑

j=1

loghxj (yj)

Zxj (θ)+ θ>fxj (yj)−

2‖θ‖2

Page 18: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Bound applications

Maximum Likelihood Algorithm for CRFs

While not convergedFor j = 1, . . . , t

Compute bound for µj ,Σj from hxj , fxj , θ

Set θ=arg minθ∈Λ∑

j12 (θ − θ)>(Σj +λI)(θ − θ)

+∑

j θ>(µj − fxj (yj) + λθ)

Theorem

The algorithm outputs a θ such that

J(θ)−J(θ0) ≥ (1− ε) maxθ∈Λ

(J(θ)−J(θ0))

within⌈

log(

)/ log

(1 + λ log n

2r2n))⌉

steps.

Page 19: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Bound applications

Maximum Likelihood Algorithm for CRFs

While not convergedFor j = 1, . . . , t

Compute bound for µj ,Σj from hxj , fxj , θ

Set θ=arg minθ∈Λ∑

j12 (θ − θ)>(Σj +λI)(θ − θ)

+∑

j θ>(µj − fxj (yj) + λθ)

Theorem

The algorithm outputs a θ such that

J(θ)−J(θ0) ≥ (1− ε) maxθ∈Λ

(J(θ)−J(θ0))

within⌈

log(

)/ log

(1 + λ log n

2r2n))⌉

steps.

Page 20: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Experiments

Experiments - Markov CRFs

Bound admits low-rank version (O(tnd))As in LBFGS, use rank-k storage Σ = VSV> + DAbsorb residual into diagonal D⇒ Low-rank is still a bound

Graphical models, e.g. Markov CRFsBuild junction tree and run a Collect algorithmOnly needs O(td2

∑c |Yc |) rather than O(td2n)

CONLL dataset

Algorithm time passes

L-BFGS 1.00t 17

CG 3.47t 23

Bound 0.64t 4

Algorithm time passes

L-BFGS 1.00t 22

CG 5.94t 27

IIS ≥ 6.35t ≥ 150Bound [W03]

Latent modelsObjective function is non-concave: ratio of partition functionsApply Jensen to numerator and our bound to denominatorOften better solution than BFGS, Newton, CG, SD, ...

Page 21: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Experiments

Experiments - Markov CRFs

Bound admits low-rank version (O(tnd))As in LBFGS, use rank-k storage Σ = VSV> + DAbsorb residual into diagonal D⇒ Low-rank is still a bound

Graphical models, e.g. Markov CRFsBuild junction tree and run a Collect algorithmOnly needs O(td2

∑c |Yc |) rather than O(td2n)

CONLL dataset

Algorithm time passes

L-BFGS 1.00t 17

CG 3.47t 23

Bound 0.64t 4

Algorithm time passes

L-BFGS 1.00t 22

CG 5.94t 27

IIS ≥ 6.35t ≥ 150Bound [W03]

Latent modelsObjective function is non-concave: ratio of partition functionsApply Jensen to numerator and our bound to denominatorOften better solution than BFGS, Newton, CG, SD, ...

Page 22: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Experiments

Experiments - Markov CRFs

Bound admits low-rank version (O(tnd))As in LBFGS, use rank-k storage Σ = VSV> + DAbsorb residual into diagonal D⇒ Low-rank is still a bound

Graphical models, e.g. Markov CRFsBuild junction tree and run a Collect algorithmOnly needs O(td2

∑c |Yc |) rather than O(td2n)

CONLL dataset

Algorithm time passes

L-BFGS 1.00t 17

CG 3.47t 23

Bound 0.64t 4

Algorithm time passes

L-BFGS 1.00t 22

CG 5.94t 27

IIS ≥ 6.35t ≥ 150Bound [W03]

Latent modelsObjective function is non-concave: ratio of partition functionsApply Jensen to numerator and our bound to denominatorOften better solution than BFGS, Newton, CG, SD, ...

Page 23: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Experiments

Experiments - Markov CRFs

Bound admits low-rank version (O(tnd))As in LBFGS, use rank-k storage Σ = VSV> + DAbsorb residual into diagonal D⇒ Low-rank is still a bound

Graphical models, e.g. Markov CRFsBuild junction tree and run a Collect algorithmOnly needs O(td2

∑c |Yc |) rather than O(td2n)

CONLL dataset

Algorithm time passes

L-BFGS 1.00t 17

CG 3.47t 23

Bound 0.64t 4

Algorithm time passes

L-BFGS 1.00t 22

CG 5.94t 27

IIS ≥ 6.35t ≥ 150Bound [W03]

Latent modelsObjective function is non-concave: ratio of partition functionsApply Jensen to numerator and our bound to denominatorOften better solution than BFGS, Newton, CG, SD, ...

Page 24: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Experiments

Experiments - Latent models

Bounding also simplifies mixture models with hidden variables(mixtures of Gaussians, HMMs, latent graphical models)Assume exponential family mixture components (Gaussian,multinomial, Poisson, Laplace)Latent CRF or log-linear model [Quattoni et al. ’07]

L(θ) =t∏

j=1

∑m exp

(θ>fj ,yj ,m

)∑y ,m exp (θ>fj ,y ,m)

≥ Q(θ, θ)

Apply Jensen to numerator and our bound to denominator

Page 25: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Experiments

Experiments - (Semi-)Stochastic Bound Majorization

Computing the bound is O(t)→ intractable for large t

Semi-stochastic: compute bound on data mini-batches

convergence to a stationary point under weak assumptions (inparticular convexity is not required)linear convergence rate for logistic regression problem whenbatch size grows sufficiently fast

Theorem

For each iteration we have (for any ε > 0)

J(θk)− J(θ∗) ≤(

1− ρ

L

)k[J(θ0)− J(θ∗)] +O(Ck) ,

with Ck = maxBk , (1− ρL + ε)k and Bk = ‖∇J(θ)|θ=θk − gk

T ‖2,where T is the mini-batch.

Page 26: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Experiments

Experiments - (Semi-)Stochastic Bound Majorization

Computing the bound is O(t)→ intractable for large t

Semi-stochastic: compute bound on data mini-batches

convergence to a stationary point under weak assumptions (inparticular convexity is not required)linear convergence rate for logistic regression problem whenbatch size grows sufficiently fast

Theorem

For each iteration we have (for any ε > 0)

J(θk)− J(θ∗) ≤(

1− ρ

L

)k[J(θ0)− J(θ∗)] +O(Ck) ,

with Ck = maxBk , (1− ρL + ε)k and Bk = ‖∇J(θ)|θ=θk − gk

T ‖2,where T is the mini-batch.

Page 27: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Experiments

Experiments - (Semi-)Stochastic Bound MajorizationDatasets: rcv1, adult, protein

5 10 15 20 25

10−2

10−1

Effective PassesObj

ectiv

e m

inus

opt

imum

(tr

aini

ng)

LBFGSSGDASGDSAGSAGlsSQB

5 10 15 2010

−1.8

10−1.7

Tes

t log

istic

loss

Effective Passes

5 10 15 20 253

3.2

3.4

3.6

3.8

4

4.2

Tes

t er

ror

[%]

Effective Passes

5 10 15 20 25

10−2

5 10 15 20 25

10−1.433

10−1.43

10−1.427

10−1.424

5 10 15 20 2515.4

15.5

15.6

15.7

15.8

15.9

5 10 15 20 2510

−4

10−3

10−2

5 10 15 20

10−1.8

10−1.7

5 10 15 20 250.26

0.28

0.3

0.32

0.34

Page 28: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

How to design good objectivefunction?

Page 29: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Multi-class classification problem

eXtreme multi-class classification problem

Problem setting

classification with large number of classes

data is accessed online

Goal:

good predictor with logarithmic training and testing time

reduction to tree-structured binary classification

top-down approach for class partitioning allowing gradientdescent style optimization

Page 30: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Multi-class classification problem

What was already done...

Intractable

one-against-all [RK04]variants of ECOC [DB95], e.g. PECOC [LB05]clustering-based approaches [BWG10, WMY13]

Choice of partition not addressed

Filter Tree and error-correcting tournaments [BLR09]

Choice of partition addressed, but dedicated to conditionalprobability estimation

conditional probability tree [BLLSS09]

Splitting criteria not well-suited to large class setting

decision trees [KM95]

. . .

Page 31: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Splitting criterion

How do you learn the structure?

Not all partitions are equally difficult, e.g.if you do 1, 7 vs 3, 8, the next problem is hard;if you do 1, 8 vs 3, 7, the next problem is easy;if you do 1, 3 vs 7, 8, the next problem is easy.

[BWG10]: Better to confuse near leaves than near root.Intuition: The root predictor tends to be overconstrained whilethe leafwards predictors are less constrained.

Page 32: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Splitting criterion

How do you learn the structure?

Not all partitions are equally difficult, e.g.if you do 1, 7 vs 3, 8, the next problem is hard;if you do 1, 8 vs 3, 7, the next problem is easy;if you do 1, 3 vs 7, 8, the next problem is easy.

[BWG10]: Better to confuse near leaves than near root.Intuition: The root predictor tends to be overconstrained whilethe leafwards predictors are less constrained.

Page 33: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Splitting criterion

How do you learn the structure?

Our approach:

top-down approach for class partitioning

splitting criterion guaranteeingbalanced tree ⇒ logarithmic training and testing timeandsmall classification error

Page 34: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Splitting criterion

Pure split and balanced split

kr (x): number of data points in the same class as x on theright side of the partitioning

k(x): total number of data points in the same class as x

nr : number of data points on the right side of the partitioning

n: total number of data points

Measure of balanceness: nrn

Page 35: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Splitting criterion

Pure split and balanced split

kr (x): number of data points in the same class as x on theright side of the partitioning

k(x): total number of data points in the same class as x

nr : number of data points on the right side of the partitioning

n: total number of data points

Measure of balanceness: nrn

Page 36: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Splitting criterion

Pure split and balanced split

kr (x): number of data points in the same class as x on theright side of the partitioning

k(x): total number of data points in the same class as x

nr : number of data points on the right side of the partitioning

n: total number of data points

Measure of balanceness: nrn

Measure of purity: kr(x)k(x)

Page 37: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Splitting criterion

Pure split and balanced split

kr (x): number of data points in the same class as x on theright side of the partitioning

k(x): total number of data points in the same class as x

nr : number of data points on the right side of the partitioning

n: total number of data points

Measure of balanceness: nrn Measure of purity: kr(x)

k(x)

Page 38: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Splitting criterion

Pure split and balanced split

k: number of classesH: hypothesis class (typically: linear classifiers)

πy =|Xy |n

balance = Pr(h(x) > 0)purity =

∑ky=1 πy min(Pr(h(x) > 0|y),Pr(h(x) < 0|y))

Definition (Balanced split)

The hypothesis h ∈ H induces a balanced split iff

∃c∈(0,0.5]c ≤ balance ≤ 1− c .

Definition (Pure split)

The hypothesis h ∈ H induces a pure split iff

∃δ∈[0,0.5)purity ≤ δ.

Page 39: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Splitting criterion

Pure split and balanced split

k: number of classesH: hypothesis class (typically: linear classifiers)

πy =|Xy |n

balance = Pr(h(x) > 0)purity =

∑ky=1 πy min(Pr(h(x) > 0|y),Pr(h(x) < 0|y))

Definition (Balanced split)

The hypothesis h ∈ H induces a balanced split iff

∃c∈(0,0.5]c ≤ balance ≤ 1− c .

Definition (Pure split)

The hypothesis h ∈ H induces a pure split iff

∃δ∈[0,0.5)purity ≤ δ.

Page 40: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Splitting criterion

Pure split and balanced split

k: number of classesH: hypothesis class (typically: linear classifiers)

πy =|Xy |n

balance = Pr(h(x) > 0)purity =

∑ky=1 πy min(Pr(h(x) > 0|y),Pr(h(x) < 0|y))

Definition (Balanced split)

The hypothesis h ∈ H induces a balanced split iff

∃c∈(0,0.5]c ≤ balance ≤ 1− c .

Definition (Pure split)

The hypothesis h ∈ H induces a pure split iff

∃δ∈[0,0.5)purity ≤ δ.

Page 41: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Splitting criterion

Objective function

J(h) = 2k∑

y=1

πy |P(h(x) > 0)− P(h(x) > 0|y)|

= 2Ex ,y [|P(h(x) > 0)− P(h(x) > 0|y)|]

J(h) ⇒ Splitting criterion (objective function)

Given a set of n examples each with one of k labels, find apartitioner h that maximizes the objective.

Lemma

For any hypothesis h : X 7→ −1, 1, the objective J(h) satisfiesJ(h) ∈ [0, 1]. Furthermore, h induces a maximally pure andbalanced partition iff J(h) = 1.

Page 42: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Splitting criterion

Balancing and purity factors

Balacing factor

balance ∈

[1−

√1− J(h)

2,

1 +√

1− J(h)

2

]

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

J(h)

y

y =1−√

1−J(h)

2

y =1+√

1−J(h)

2

Page 43: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Splitting criterion

Balancing and purity factors

Purity factor

purity ≤ 2− J(h)

4 · balance− balance

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

J(h)

y

balance = 1/2

y = 2−J (h)4·balance − balance

Page 44: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Boosting statement

What is the quality of obtained tree?

In each node of the tree T optimize the splitting criterion

Apply recursively to construct a tree structure

Measure the quality of the tree using entropy

GT =∑

l∈leafs of Twl

k∑y=1

πl ,y ln

(1

πl ,y

)Why?

Small entropy of leafs ⇒ pure leafs

Goal: maximizing the objective reduces the entropy

Page 45: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Boosting statement

What is the quality of obtained tree?

Definition (Weak Hypothesis Assumption)

Let m denotes any node of the tree T , and let βm = P(hm(x) > 0)and Pm,i = P(hm(x) > 0|i). Furthermore, let γ ∈ R+ be such thatfor all m, γ ∈ (0,min(βm, 1− βm)]. We say that the weakhypothesis assumption is satisfied when for any distribution P overX at each node m of the tree T there exists a hypothesis hm ∈ Hsuch that J(hm)/2 =

∑ki=1 πm,i |Pm,i − βm| ≥ γ.

Theorem

Under the Weak Hypothesis Assumption, for any ε ∈ [0, 1], to

obtain GT ≤ ε it suffices to make(

) 4(1−γ)2ln k

γ2 splits.

Tree depth ≈ log

[(1ε

) 4(1−γ)2ln k

γ2

]= O(ln k) ⇒

⇒ logarithmic training and testing time

Page 46: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Boosting statement

What is the quality of obtained tree?

Definition (Weak Hypothesis Assumption)

Let m denotes any node of the tree T , and let βm = P(hm(x) > 0)and Pm,i = P(hm(x) > 0|i). Furthermore, let γ ∈ R+ be such thatfor all m, γ ∈ (0,min(βm, 1− βm)]. We say that the weakhypothesis assumption is satisfied when for any distribution P overX at each node m of the tree T there exists a hypothesis hm ∈ Hsuch that J(hm)/2 =

∑ki=1 πm,i |Pm,i − βm| ≥ γ.

Theorem

Under the Weak Hypothesis Assumption, for any ε ∈ [0, 1], to

obtain GT ≤ ε it suffices to make(

) 4(1−γ)2ln k

γ2 splits.

Tree depth ≈ log

[(1ε

) 4(1−γ)2ln k

γ2

]= O(ln k) ⇒

⇒ logarithmic training and testing time

Page 47: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Online partitioning

LOMtree algorithm

Recall the objective function we consider at every tree node:

J(h) = 2Ey [|Ex [1(h(x) > 0)]− Ex [1(h(x) > 0|y)]|].

Problem: discrete optimizationRelaxation: drop the indicator operator and look at margins

The objective function becomes

J(h) = 2Ey [|Ex [h(x)]− Ex [h(x)|y ]|]

Keep the online empirical estimates of these expectations.The sign of the difference of two expectations decides whetherto send an example to the left or right child node.

Page 48: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Online partitioning

LOMtree algorithm

Recall the objective function we consider at every tree node:

J(h) = 2Ey [|Ex [1(h(x) > 0)]− Ex [1(h(x) > 0|y)]|].

Problem: discrete optimizationRelaxation: drop the indicator operator and look at margins

The objective function becomes

J(h) = 2Ey [|Ex [h(x)]− Ex [h(x)|y ]|]

Keep the online empirical estimates of these expectations.The sign of the difference of two expectations decides whetherto send an example to the left or right child node.

Page 49: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Online partitioning

LOMtree algorithm

Recall the objective function we consider at every tree node:

J(h) = 2Ey [|Ex [1(h(x) > 0)]− Ex [1(h(x) > 0|y)]|].

Problem: discrete optimizationRelaxation: drop the indicator operator and look at margins

The objective function becomes

J(h) = 2Ey [|Ex [h(x)]− Ex [h(x)|y ]|]

Keep the online empirical estimates of these expectations.The sign of the difference of two expectations decides whetherto send an example to the left or right child node.

Page 50: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Online partitioning

LOMtree algorithm

Let e = 0 and for all y , ey = 0, ny = 0For each example (x , y)

if ey < e then b = −1 else b = 1

Update w using (x , b)

ny ← ny + 1

ey ← (ny−1)eyny

+ w .xny

e ← (n−1)en + w .x

n

Page 51: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Online partitioning

LOMtree algorithm

Let e = 0 and for all y , ey = 0, ny = 0For each example (x , y)

if ey < e then b = −1 else b = 1

Update w using (x , b)

ny ← ny + 1

ey ← (ny−1)eyny

+ w .xny

e ← (n−1)en + w .x

n

Page 52: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Online partitioning

LOMtree algorithm

Let e = 0 and for all y , ey = 0, ny = 0For each example (x , y)

if ey < e then b = −1 else b = 1

Update w using (x , b)

ny ← ny + 1

ey ← (ny−1)eyny

+ w .xny

e ← (n−1)en + w .x

n

Page 53: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Online partitioning

LOMtree algorithm

Let e = 0 and for all y , ey = 0, ny = 0For each example (x , y)

if ey < e then b = −1 else b = 1

Update w using (x , b)

ny ← ny + 1

ey ← (ny−1)eyny

+ w .xny

e ← (n−1)en + w .x

n

Page 54: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Online partitioning

LOMtree algorithm

Let e = 0 and for all y , ey = 0, ny = 0For each example (x , y)

if ey < e then b = −1 else b = 1

Update w using (x , b)

ny ← ny + 1

ey ← (ny−1)eyny

+ w .xny

e ← (n−1)en + w .x

n

Page 55: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Online partitioning

LOMtree algorithm

Let e = 0 and for all y , ey = 0, ny = 0For each example (x , y)

if ey < e then b = −1 else b = 1

Update w using (x , b)

ny ← ny + 1

ey ← (ny−1)eyny

+ w .xny

e ← (n−1)en + w .x

n

Page 56: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Online partitioning

LOMtree algorithm

Let e = 0 and for all y , ey = 0, ny = 0For each example (x , y)

if ey < e then b = −1 else b = 1

Update w using (x , b)

ny ← ny + 1

ey ← (ny−1)eyny

+ w .xny

e ← (n−1)en + w .x

n

Page 57: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Online partitioning

LOMtree algorithm

Let e = 0 and for all y , ey = 0, ny = 0For each example (x , y)

if ey < e then b = −1 else b = 1

Update w using (x , b)

ny ← ny + 1

ey ← (ny−1)eyny

+ w .xny

e ← (n−1)en + w .x

n

Page 58: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Online partitioning

LOMtree algorithm

Let e = 0 and for all y , ey = 0, ny = 0For each example (x , y)

if ey < e then b = −1 else b = 1

Update w using (x , b)

ny ← ny + 1

ey ← (ny−1)eyny

+ w .xny

e ← (n−1)en + w .x

n

Page 59: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Online partitioning

LOMtree algorithm

Let e = 0 and for all y , ey = 0, ny = 0For each example (x , y)

if ey < e then b = −1 else b = 1

Update w using (x , b)

ny ← ny + 1

ey ← (ny−1)eyny

+ w .xny

e ← (n−1)en + w .x

n

Page 60: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Online partitioning

LOMtree algorithm

Let e = 0 and for all y , ey = 0, ny = 0For each example (x , y)

if ey < e then b = −1 else b = 1

Update w using (x , b)

ny ← ny + 1

ey ← (ny−1)eyny

+ w .xny

e ← (n−1)en + w .x

n

Page 61: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Online partitioning

LOMtree algorithm

Let e = 0 and for all y , ey = 0, ny = 0For each example (x , y)

if ey < e then b = −1 else b = 1

Update w using (x , b)

ny ← ny + 1

ey ← (ny−1)eyny

+ w .xny

e ← (n−1)en + w .x

n

Apply recursively toconstruct a tree structure.

Page 62: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Online partitioning

LOMtree algorithm

Let e = 0 and for all y , ey = 0, ny = 0For each example (x , y)

if ey < e then b = −1 else b = 1

Update w using (x , b)

ny ← ny + 1

ey ← (ny−1)eyny

+ w .xny

e ← (n−1)en + w .x

n

Apply recursively toconstruct a tree structure.

Page 63: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Online partitioning

LOMtree algorithm

Let e = 0 and for all y , ey = 0, ny = 0For each example (x , y)

if ey < e then b = −1 else b = 1

Update w using (x , b)

ny ← ny + 1

ey ← (ny−1)eyny

+ w .xny

e ← (n−1)en + w .x

n

Apply recursively toconstruct a tree structure.

Page 64: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Experiments

Experiments

Table : Training time on selected problems.Isolet Sector Aloi

LOMtree 16.27s 12.77s 51.86sOAA 19.58s 18.37s 11m2.43s

Table : Per-example test time on all problems.Isolet Sector Aloi ImNet ODP

LOMtree 0.14ms 0.13ms 0.06ms 0.52ms 0.26msOAA 0.16 ms 0.24ms 0.33ms 0.21s 1.05s

Table : Test error (%) and confidence interval on all problems.

LOMtree Rtree Filter tree

Isolet (26) 6.36±1.71 16.92±2.63 15.10±2.51

Sector (105) 16.19±2.33 15.77±2.30 17.70±2.41

Aloi (1000) 16.50±0.70 83.74±0.70 80.50±0.75

ImNet (22K) 90.17±0.05 96.99±0.03 92.12±0.04

ODP (105K) 93.46±0.12 93.85±0.12 93.76±0.12

Page 65: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Experiments

Experiments

26 105 1000 21841 1050330

0.2

0.4

0.6

0.8

1

number of classes

accu

racy

LOMtree vs one−against−all

OAALOMtree

6 8 10 12 14 16

2

4

6

8

10

12

log2(number of classes)

log

2(tim

e ra

tio)

LOMtree vs one−against−all

Page 66: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

How to understand non-convexoptimization?

Page 67: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Deep learning: motivation and challenges

Why non-convex optimization and deep learning?

State-of-the art results on number of problems:

image recognition [KSH12, CMGS10]

speech recognition [HDYDMJSVNSK12, GMH13]

natural language processing [WCA14]

video recognition [KTSLSF-F14, SZ14]

Page 68: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Deep learning: motivation and challenges

Why non-convex optimization and deep learning?

ImageNet 2014 Challenge: mostly convolutional networks

Page 69: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Deep learning: motivation and challenges

Challenge

Goal: Understanding loss function in deep learning.

Recent related works: Choromanska et al., 2015, Goodfellow etal., 2015, Dauphin et al., 2014, Saxe et al., 2014.

Questions:

Why the result of multiple experiments with multilayernetworks consistently give very similar performance despitethe presence of many local minima?

What is the role of saddle points in the optimization problem?

Is the surface of the loss function of multilayer networksstructured?

Page 70: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Deep learning: motivation and challenges

Challenge

Goal: Understanding loss function in deep learning.

Recent related works: Choromanska et al., 2015, Goodfellow etal., 2015, Dauphin et al., 2014, Saxe et al., 2014.

Questions:

Why the result of multiple experiments with multilayernetworks consistently give very similar performance despitethe presence of many local minima?

What is the role of saddle points in the optimization problem?

Is the surface of the loss function of multilayer networksstructured?

Page 71: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Deep learning: motivation and challenges

Challenge

Goal: Understanding loss function in deep learning.

Recent related works: Choromanska et al., 2015, Goodfellow etal., 2015, Dauphin et al., 2014, Saxe et al., 2014.

Questions:

Why the result of multiple experiments with multilayernetworks consistently give very similar performance despitethe presence of many local minima?

What is the role of saddle points in the optimization problem?

Is the surface of the loss function of multilayer networksstructured?

Page 72: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Deep learning: motivation and challenges

Challenge

Goal: Understanding loss function in deep learning.

Recent related works: Choromanska et al., 2015, Goodfellow etal., 2015, Dauphin et al., 2014, Saxe et al., 2014.

Questions:

Why the result of multiple experiments with multilayernetworks consistently give very similar performance despitethe presence of many local minima?

What is the role of saddle points in the optimization problem?

Is the surface of the loss function of multilayer networksstructured?

Page 73: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Deep learning: motivation and challenges

Multilayer network and spin-glass model

Can we use the spin-glass theory to explain the optimizationparadigm with large multilayer networks?

What assumptions need to be made?

Page 74: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Deep learning: motivation and challenges

Multilayer network and spin-glass model

Can we use the spin-glass theory to explain the optimizationparadigm with large multilayer networks?

What assumptions need to be made?

Page 75: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Non-convex loss function in deep learning

Loss function in deep learning and assumptions

X1

X2

. . .. . .

Xd

1

2

. . .

n1

1

2

. . .

n2

. . .

1

2

. . .

nH

Y

Y =1

Λ(H−1)/2

Ψ∑i=1

XiAi

H∏k=1

w(k)i ,

Ψ - number of input-output paths, Λ = H√

Ψ (assume Λ ∈ Z+)

H − 1 - number of hidden layers

w(k)i - the weight of the kth segment of the i th path

Ai - Bernoulli r.v. denoting path activation (0/1)

Page 76: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Non-convex loss function in deep learning

Loss function in deep learning and assumptions

Consider hinge loss

L(w) = max(0, 1− YtY ),

where Yt corresponds to the true data labeling (1/− 1), and wdenotes all network weights.

max operator is often modeled as Bernoulli r.v. (0/1). Denoteit as M and its expectation as ρ

′. Therefore

L(w) = M(1− YtY ) = M +1

Λ(H−1)/2

Ψ∑i=1

Zi Ii

H∏k=1

w(k)i , (1)

where Zi = −YtXi , and Ii = MAi is a Bernoulli r.v. (0/1).

assume I1, I2, . . . , IΨ are identically distributed (A1p)

assume each Xi is a standard Gaussian r.v. (A2p)

Page 77: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Non-convex loss function in deep learning

Loss function in deep learning and assumptions

assume network parametrization is redundant (A3p)

assume unique parameters are uniformly distributed on thegraph of connections of the network (A4p), i.e. everyH-length product of unique weights appears in Equation 1(the set of all products is wi1wi2 . . .wiHΛ

i1,i2,...,iH=1).

L(w) = M +1

Λ(H−1)/2

Λ∑i1,i2,...,iH=1

Zi1,i2,...,iH Ii1,i2,...,iH wi1wi2 . . .wiH .

Page 78: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Non-convex loss function in deep learning

Loss function in deep learning and assumptions

Definition

A network M which has the same graph of connections as networkN , whose size is N, and s unique weights satisfying s ≤ N is calleda (s, ε)-reduction image of N for some ε ∈ [0, 1] if the predictionaccuracy of N and M differ by no more than ε (thus they classifyat most ε fraction of data points differently).

Theorem

Let N be a neural network giving the output whose expectationwrt. A’s is YN . Let M be its (s, ε)-reduction image for somes ≤ N and ε ∈ [0, 0.5]. By analogy, let Ys be the expected outputof network M. Then the following holds

corr(sign(Ys), sign(YN)) ≥ 1− 2ε

1 + 2ε,

where corr(A,B) = E[(A−E[A]])(B−E[B]])std(A)std(B) , std is the standard

deviation and sign(·) denotes the sign of prediction.

Page 79: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Non-convex loss function in deep learning

Loss function in deep learning and assumptions

assume the independence of Zi1,i2,...,iH and Ii1,i2,...,iH (A5u)

EM,I1,I2,...,IΨ [L(w)] = ρ′

+ ρ1

Λ(H−1)/2

Λ∑i1,i2,...,iH=1

Zi1,i2,...,iH wi1wi2 . . .wiH︸ ︷︷ ︸LΛ,H

.

assume that Z ’s are independent (A6u)impose spherical constraint (A7p)

1

Λ

Λ∑i=1

w 2i = 1.

We obtain the Hamiltonian of the spherical spin-glassmodel!!!

Question: What happens when Λ→∞?

Page 80: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Spherical spin-glass model

Important quantities

Definition

Let the following quantity be called an energy barrier

E∞ = E∞(H) = 2

√H − 1

H.

Definition

Let the normalized minimum of the Hamiltonian LΛ,H be called aground state and be defined as

E0 =1

Λinf

σ∈SN−1(√

Λ)LΛ,H(σ).

Let (Ek(H))k∈N be a strictly decreasing sequence, that isconverging to E∞ as k →∞.

Page 81: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Spherical spin-glass model

Hamiltonian of the spherical spin-glass model: properties

All critical values of the Hamiltonian of fixed index1 (nondiverging with Λ) must lie in the band (−ΛE0(H),−ΛE∞(H)).

Finding a critical value with index larger or equal to k (for anyfixed integer k) below energy level −ΛEk(H) is improbable.

With overwhelming probability the critical values just abovethe global minimum (ground state) are local minimaexclusively. Above the band (−ΛE0(H),−ΛE1(H)) containingonly local minima (critical points of index 0), there is anotherone, (−ΛE1(H),−ΛE2(H)), where one can only find localminima and saddle points of index 1, and above this bandthere exists another one, (−ΛE2(H),−ΛE3(H)), where onecan only find local minima and saddle points of index 1 and 2,and so on.

1Index of ∇2L at w is the number of negative eigenvalues of the Hessian∇2L at w. Local minima have index 0.

Page 82: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Spherical spin-glass model

Hamiltonian of the spherical spin-glass model: properties

−1500 −1000 −500 00

1

2

3x 10

151

Λ u

Mea

n nu

mbe

r of

c

ritic

al p

oint

s

−Λ E0

−Λ Einf

−1655−1650−1645−1640−16350

0.5

1

1.5

2

2.5x 10

8

Λ u

Mea

n nu

mbe

r of

low

−in

dex

crit

ical

poi

nts

k=0k=1k=2k=3k=4k=5−Λ E

0

−Λ Einf

−1650 −1645 −1640

0.5

1

1.5

2

2.5x 10

7

Λ u

Mea

n nu

mbe

r of

low

−in

dex

crit

ical

poi

nts

(zoo

med

)

−1635 −1634 −1633 −16320

0.5

1

1.5

2

2.5x 10

8

Λ u

Mea

n nu

mbe

r of

low

−in

dex

crit

ical

poi

nts

(zoo

med

)

Figure : H = 3 and Λ = 1000. Black line: u = −ΛE0(H), red line:u = −ΛE∞(H).

Page 83: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Deep networks versus spherical spin-glass models

Comparison of deep network and spherical spin-glass model

0

25

50

75

100

125

−1.6 −1.5 −1.4 −1.3loss

coun

t

Lambda2550100250500

0

20

40

60

0.08 0.09 0.10

loss

coun

t

nhidden2550100250500

10 25 50 100 250 500

0.10

0.15

0.20

0.25

nhidden

test

loss

10 25 50 100

0.10

0.15

0.20

0.25

nhidden

test

loss

0 100 200 300 400 500

0.10

0.15

0.20

nhidden

test

loss

mea

n

0 100 200 300 400 5000e+

003e

−05

6e−

05

nhidden

test

loss

var

ianc

e

Figure : H = 3 and Λ = 1000. Black line: u = −ΛE0(H), red line:u = −ΛE∞(H).

Page 84: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Deep networks versus spherical spin-glass models

Deep network: correlation between train and test loss

a) n1 = 2 b) n1 = 5 c) n1 = 10 d) n1 = 25

1.0 1.1 1.2 1.3 1.4 1.5 1.6

1.0

1.1

1.2

1.3

1.4

1.5

1.6

train loss

test

loss

0.38 0.40 0.42 0.44 0.46 0.48 0.50

0.40

0.45

0.50

train loss

test

loss

0.21 0.22 0.23 0.24 0.25

0.20

0.21

0.22

0.23

0.24

0.25

train loss

test

loss

0.115 0.120 0.125 0.130 0.135

0.12

00.

125

0.13

00.

135

0.14

00.

145

train loss

test

loss

e) n1 = 50 f) n1 = 100 g) n1 = 250 h) n1 = 500

0.085 0.090 0.095 0.100

0.09

50.

100

0.10

50.

110

train loss

test

loss

0.074 0.076 0.078 0.0800.08

20.

086

0.09

00.

094

train loss

test

loss

0.065 0.066 0.067 0.068 0.069

0.07

80.

080

0.08

20.

084

train loss

test

loss

0.0615 0.0620 0.0625 0.0630 0.0635

0.07

50.

076

0.07

70.

078

0.07

90.

080

train loss

test

loss

n1 25 50 100 250 500

ρ 0.7616 0.6861 0.5983 0.5302 0.4081

Table : Pearson correlation between training and test loss.

Page 85: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Deep networks versus spherical spin-glass models

Deep network: index of recovered solutions

a)

nhidden=10

normalized index

Fre

quen

cy

0.0000 0.0005 0.0010 0.0015 0.0020 0.0025

020

040

060

080

0

b)

nhidden=25

normalized index

Fre

quen

cy

0.000 0.005 0.010 0.015

020

040

060

080

0

c)

nhidden=50

normalized index

Fre

quen

cy

0.000 0.001 0.002 0.003 0.004 0.005

020

040

060

080

0

d)

nhidden=100

normalized index

Fre

quen

cy

0e+00 1e−04 2e−04 3e−04 4e−040

5010

015

020

025

0

Figure : Distribution of normalized index of solutions forn1 = a)10, b)25, c)50, d)100 hidden units.

Page 86: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Understanding non-convex deep learning optimization

Spherical spin-glass versus deep network

Conjecture (Deep learning)

For large-size networks, most local minima are equivalent and yieldsimilar performance on a test set.

Spherical spin-glassCritical points form an ordered structure such that there exists anenergy barrier ΛE−∞ (a certain value of the Hamiltonian) belowwhich with overwhelming probability one can find only low-indexcritical points, most of which are concentrated close to the barrier.

Page 87: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Understanding non-convex deep learning optimization

Spherical spin-glass versus deep network

Conjecture (Deep learning)

The probability of finding a “bad” (high value) local minimum isnon-zero for small-size networks and decreases quickly withnetwork size.

Spherical spin-glassLow-index critical points are ’geometrically’ lying closer to theground state than high-index critical points.

Page 88: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Understanding non-convex deep learning optimization

Spherical spin-glass versus deep network

Conjecture (Deep learning)

Saddle points play a key-role in the optimization problem in deeplearning.

Spherical spin-glassWith overwhelming probability one can find only high-index saddlepoints above energy ΛE−∞ and there are exponentially many ofthose.

−1500 −1000 −500 00

1

2

3x 10

151

Λ u

Mea

n nu

mbe

r of

c

ritic

al p

oint

s

−Λ E0

−Λ Einf

−1655−1650−1645−1640−16350

0.5

1

1.5

2

2.5x 10

8

Λ u

Mea

n nu

mbe

r of

low

−in

dex

crit

ical

poi

nts

k=0k=1k=2k=3k=4k=5−Λ E

0

−Λ Einf

Figure : H = 3 and Λ = 1000. Black line: u = −ΛE0(H) (ground state),red line: u = −ΛE∞(H) (energy barrier).

Page 89: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Understanding non-convex deep learning optimization

Spherical spin-glass versus deep network

Conjecture (Deep learning)

Struggling to find the global minimum on the training set (asopposed to one of the many good local ones) is not useful inpractice and may lead to overfitting.

n1 25 50 100 250 500

ρ 0.7616 0.6861 0.5983 0.5302 0.4081

Table : Pearson correlation between training and test loss for differentnumbers of hidden units of a network with one hidden layer. MNISTdataset.

Spherical spin-glassRecovering the ground state, i.e. global minimum, takesexponentially long time.

Page 90: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Understanding non-convex deep learning optimization

Take-home message

For large-size networks, most local minima are equivalent andyield similar performance on a test set.

The probability of finding a “bad” (high value) local minimumis non-zero for small-size networks and decreases quickly withnetwork size.

Struggling to find the global minimum on the training set (asopposed to one of the many good local ones) is not useful inpractice and may lead to overfitting.

Page 91: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Understanding non-convex deep learning optimization

Open problem

Can we establish a stronger connection between the lossfunction of the deep model and the spherical spin-glass

model by dropping the unrealistic assumptions?

Page 92: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Conclusions

Convexity and non-convexity: challenges

Convex and non-convex worldBuilding solvers, i.e. bound majorization

New and tight quadratic bound on the partition functionLinear convergence of the batch/semi-stochastic variantsCompetetive/better than state-of-the-art methodsAdmits multiple extensions

Designing problem-specific, i.e. multi-classification, objectivesLogarithmic training and testing timeReduction from multi-class to binary classificationNew splitting criterion with desirable properties

allows gradient descent style optimizationmakes decision trees applicable to multi-class classification

Non-convex worldUnderstading why non-convex approaches work

Deep learning: state-of-the-art in numerous problemsPossible connection between spin-glass theory and deeplearningLandscape is highly non-convex but most likely structured

Page 93: Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Intro Convex solver: bound majorization (Convex) objective: multi-classification Non-convexity: deep learning Summary

Conclusions

Acknowledgments

Courant Institute of Mathematical Sciences: Yann LeCun,Gerard Ben ArousColumbia University: Tony Jebara, Shih-Fu ChangGeorge Washington University: Claire MonteleoniNYU Polytechnic School of Engineering: Mariusz BojarskiMicrosoft Research: John Langford, Alekh AgarwalGoogle Research: Krzysztof Choromanski, Dimitri KanevskyIBM T. J.Watson Research Center: Aleksandr AravkinATT Shannon Research Laboratories: Phyllis Weiss, Alice ChenLEAR team of Inria: Zaid Harchaoui

PostDocs: Pablo SprechmannPhD students: Michael Mathieu, Mikael Bruce Henaff, SixinZhang, Ross Goroshin, Rahul Krishnan, Wojciech Zaremba,Hyungtae Kim