principled deep neural network training through linear … · 2019. 1. 10. · principled deep...

Principled Deep Neural Network Trainingthrough Linear Programming

Daniel Bienstock1, Gonzalo Muñoz2, Sebastian Pokutta3

January 9, 20191IEOR, Columbia University

2IVADO, Polytechnique Montréal

3ISyE, Georgia Tech

1

“...I’m starting to look at machine learning problems”

Oktay Günlük’s research interests, Aussois 2019

2

Goal of this talk

Goal of this talk

• Deep Learning is receiving significant attention due to its impressiveperformance.

• Unfortunately, only recent results regarding the complexity oftraining deep neural networks have been obtained.

• Our goal: to show that large classes of Neural Networks can betrained to near optimality using linear programs whose size is linearon the data.

3

Empirical Risk Minimization problem

Given:

• D data points (x̂i, ŷi), i = 1, . . . ,D• x̂i ∈ Rn, ŷi ∈ Rm

• A loss function ℓ : Rm × Rm → R (not necessarily convex)

Compute f : Rn → Rm to solve

minf

1D

D∑i=1

ℓ(f(x̂i), ŷi) (+ optional regularizer Φ(f))

f ∈ F (some class)

4

Empirical Risk Minimization problem

minf

1D

D∑i=1

ℓ(f(x̂i), ŷi) (+ optional regularizer Φ(f))

f ∈ F (some class)

Examples:

• Linear Regression. f(x) = Ax + b with ℓ2-loss.• Binary Classification. Varying f architectures and cross-entropy loss:

ℓ(p, y) = −y log(p)− (1 − y) log(1 − p)• Neural Networks with k layers.

f(x) = Tk+1 ◦ σ ◦ Tk ◦ σ . . . ◦ σ ◦ T1(x), each Tj affine.

5

Function parameterization

We assume family F (statisticians’ hypothesis) is parameterized: thereexists f such that

F = {f(x, θ) : θ ∈ Θ ⊆ [−1, 1]N}.

Thus, THE problem becomes

minθ∈Θ

1D

D∑i=1

ℓ(f(x̂i, θ), ŷi)

6

What we know for Neural Nets

Neural Networks

• D data points (x̂i, ŷi), 1 ≤ i ≤ D, x̂i ∈ Rn, ŷi ∈ Rm

• f = Tk+1 ◦ σ ◦ Tk ◦ σ . . . ◦ σ ◦ T1• Each Ti affine Ti(y) = Aiy + bi• A1 is n × w, Ak+1 is w × m, Ai is w × w otherwise.

...

n w w m

7

Neural Networks


• f = Tk+1 ◦ σ ◦ Tk ◦ σ . . . ◦ σ ◦ T1

• Each Ti affine Ti(y) = Aiy + bi• A1 is n × w, Ak+1 is w × m, Ai is w × w otherwise.

...

n w w m

7

Neural Networks


• f = Tk+1 ◦ σ ◦ Tk ◦ σ . . . ◦ σ ◦ T1• Each Ti affine Ti(y) = Aiy + bi

• A1 is n × w, Ak+1 is w × m, Ai is w × w otherwise.

...

n w w m

7

Neural Networks


• f = Tk+1 ◦ σ ◦ Tk ◦ σ . . . ◦ σ ◦ T1• Each Ti affine Ti(y) = Aiy + bi• A1 is n × w, Ak+1 is w × m, Ai is w × w otherwise.

...

n w w m

7

Hardness Results

Theorem (Blum and Rivest 1992)Let x̂i ∈ Rn, ŷi ∈ {0, 1}, ℓ ∈ (absolute value, 2-norm squared) and σ athreshold function. Then training is NP-hard even in this simple network:

...

Theorem (Boob, Dey and Lan 2018)Let x̂i ∈ Rn, ŷi ∈ {0, 1}, ℓ a norm and σ(t) = max{0, t} a ReLUactivation. Then training is NP-hard in the same network.

8

Exact Training Complexity

Theorem (Arora, Basu, Mianjy and Mukherjee 2018)If k = 1 (one “hidden layer”), m = 1 and ℓ is convex, there is an exacttraining algorithm of complexity

O ( 2wDnwpoly(D, n,w) )

Polynomial in the size of the data set, for fixed n,w.

Also in that paper:“we are not aware of any complexity results which would rule out thepossibility of an algorithm which trains to global optimality in timethat is polynomial in the data size”

“Perhaps an even better breakthrough would be to get optimal trainingalgorithms for DNNs with two or more hidden layers and this seemslike a substantially harder nut to crack”

9

What we’ll prove

There exists a polytope:

whose size depends linearly on D that encodes approximately all possibletraining problems coming from (x̂i, ŷi)Di=1 ⊆ [−1, 1](n+m)D.

Spoiler: Theory-only results

10

What we’ll prove


whose size depends linearly on D

that encodes approximately all possibletraining problems coming from (x̂i, ŷi)Di=1 ⊆ [−1, 1](n+m)D.


10

What we’ll prove


whose size depends linearly on D that encodes approximately all possibletraining problems coming from (x̂i, ŷi)Di=1 ⊆ [−1, 1](n+m)D.


10

Our Hammer

Treewidth

Treewidth is a parameter that measures how tree-like a graph is.

DefinitionGiven a chordal graph G, we say its treewidth is ω if its clique number isω + 1.

• Trees have treewidth 1• Cycles have treewidth 2• Kn has treewidth n − 1

11

Approximate optimization of well-behaved functions

Prototype problem:

min cTxs.t. fi(x) ≤ 0, i = 1, . . . ,m

x ∈ [0, 1]n

Toolset:

• Each fi is “well-behaved”: Lipschitz constant Li over [0, 1]n• Intersection graph: An edge whenever two variables appear in the

same fi

For example:

x1 + x2 + x3 ≤ 1x3 + x4 ≥ 1

x4 · x5 + x6 ≤ 2

The intersection graph is:

1

2

3 4

5

6

12


Prototype problem:

min cTxs.t. fi(x) ≤ 0, i = 1, . . . ,m

x ∈ [0, 1]n

Toolset:

• Each fi is “well-behaved”: Lipschitz constant Li over [0, 1]n

• Intersection graph: An edge whenever two variables appear in thesame fi

For example:

x1 + x2 + x3 ≤ 1x3 + x4 ≥ 1

x4 · x5 + x6 ≤ 2


1

2

3 4

5

6

12


Prototype problem:

min cTxs.t. fi(x) ≤ 0, i = 1, . . . ,m

x ∈ [0, 1]n

Toolset:

• Each fi is “well-behaved”: Lipschitz constant Li over [0, 1]n• Intersection graph: An edge whenever two variables appear in the

same fi

For example:

x1 + x2 + x3 ≤ 1x3 + x4 ≥ 1

x4 · x5 + x6 ≤ 2


1

2

3 4

5

6

12


Prototype problem:

min cTxs.t. fi(x) ≤ 0, i = 1, . . . ,m

x ∈ [0, 1]n

An extension of result by Bienstock and M. 2018:TheoremSuppose the intersection graph has tree-width ω and let L = maxi Li.

Then, for every ϵ > 0 there is an LP relaxation of size

O((L/ϵ)ω+1 n

)that guarantees ϵ optimality and feasibility errors.

13


Prototype problem:

min cTxs.t. fi(x) ≤ 0, i = 1, . . . ,m

x ∈ [0, 1]n

An extension of result by Bienstock and M. 2018:TheoremSuppose the intersection graph has tree-width ω and let L = maxi Li.Then, for every ϵ > 0 there is an LP relaxation of size

O((L/ϵ)ω+1 n

)that guarantees ϵ optimality and feasibility errors.

13

Application to ERM problem

We now apply the LP approximation result to:

minθ∈Θ

1D

D∑i=1


with Θ ⊆ [−1, 1]N, x̂i ∈ [−1, 1]n and ŷi ∈ [−1, 1]m.

We use the epigraphformulation:

minθ∈Θ

1D

D∑i=1

Li

Li ≥ ℓ(f(x̂i, θ), ŷi) 1 ≤ i ≤ D

Let L be the Lipschitz constant of g(x, y, θ) .= ℓ(f(x, θ), y) over[−1, 1]n+m+N.

14


We now apply the LP approximation result to:

minθ∈Θ

1D

D∑i=1


with Θ ⊆ [−1, 1]N, x̂i ∈ [−1, 1]n and ŷi ∈ [−1, 1]m. We use the epigraphformulation:

minθ∈Θ

1D

D∑i=1

Li

Li ≥ ℓ(f(x̂i, θ), ŷi) 1 ≤ i ≤ D

Let L be the Lipschitz constant of g(x, y, θ) .= ℓ(f(x, θ), y) over[−1, 1]n+m+N.

14


TheoremFor every ϵ > 0, ℓ, Θ ⊆ [−1, 1]N and D, there is a polytope of size

O((2L/ϵ)N+n+m D

)

such that for every data set (X̂, Ŷ) = (x̂i, ŷi)Di=1 ⊆ [−1, 1](n+m)D, there isa face FX̂,Ŷ such that optimizing 1D

∑Di=1 Li over FX̂,Ŷ provides an

ϵ-approximation to ERM with data X̂, Ŷ.

15



O((2L/ϵ)N+n+m D

)such that for every data set (X̂, Ŷ) = (x̂i, ŷi)Di=1 ⊆ [−1, 1](n+m)D, there isa face FX̂,Ŷ

such that optimizing 1D∑D

i=1 Li over FX̂,Ŷ provides anϵ-approximation to ERM with data X̂, Ŷ.

15



O((2L/ϵ)N+n+m D

)such that for every data set (X̂, Ŷ) = (x̂i, ŷi)Di=1 ⊆ [−1, 1](n+m)D, there isa face FX̂,Ŷ such that optimizing 1D

∑Di=1 Li over FX̂,Ŷ provides an

ϵ-approximation to ERM with data X̂, Ŷ.

15

Proof Sketch

Every system of constraints of the type

Li ≥ ℓ(f(xi, θ), yi) 1 ≤ i ≤ D

has an intersection graph with the following structure:

θ1, · · · , θN

L1x1, y1

L2x2, y2

L3x3, y3

LDxD, yD

L4x4, y4

and has treewidth at most N + n + m

16

Proof Sketch

Every system of constraints of the type

Li ≥ ℓ(f(xi, θ), yi) 1 ≤ i ≤ D

has an intersection graph with the following structure:

θ1, · · · , θN

L1x1, y1

L2x2, y2

L3x3, y3

LDxD, yD

L4x4, y4

and has treewidth at most N + n + m16

LP size details

Thus the LP size given by the treewidth

O((L/ϵ)ω+1 n

)becomes

O((2L/ϵ)N+n+m D

)

The key lies in the fact that the D does not add to the treewidth.

Different architectures → N and L.

17

Architecture-SpecificConsequences

Fully connected DNN, ReLU activations, quadratic loss

For any k, n,m,w, ϵ there is a uniform LP of size

O((2k+1mnwk2/ϵ)N+n+m D

)with the same guarantees: ϵ-approximation and data-dependent faces

Core of the proof: In a DNN with k hidden layers and quadratic loss theLipschitz constant of g(x, y, θ) over [−1, 1]n+m+N is O(mnwk2).

18

Comparison with Arora et al.

In the Arora, Basu, Mianjy and Mukherjee setting: k = 1,m = 1 andN ≈ nw

Arora et al. Running TimeO ( 2wDnwpoly(D, n,w) )

Uniform LP SizeO((4nw/ϵ)(n+1)(w+1) D

)

Other differences:exactness, boundedness, convexity v lipschitz-ness, uniformness

19

Last comments

• The results can be improved by considering the sparsity of thenetwork itself.

• One can obtain previously unknown complexity results (ResNet,Convolutional NN, etc)

• Training using this approach generalizes. Meaning, using enough1i.i.d data points we get an approximation to the “true” RiskMinimization problem. Our results improve on the bestapproximations to this problem as well.

1depends on L and ϵ

20

Still Open and Future Work

• It is unknown if the dependency on w or k can be improved• A better LP size can be obtained assuming more about the input

data or the nature of the problem• We would like to combine these ideas with empirically efficient

methods

21

Thank you!

21

One other improvement

If we denote G the underlying Neural Network, we can improve theexponent in

O((nw/ϵ)poly(n,k,w,m) D

)using the treewidth of G tw(G), and its maximum degree ∆(G).

More specifically, one can obtain a uniform LP of size

O((nw/ϵ)O(k·tw(G)·∆(G)) (|E(G)|+ D)

)

22

One other improvement

If we denote G the underlying Neural Network, we can improve theexponent in

O((nw/ϵ)poly(n,k,w,m) D

)using the treewidth of G tw(G), and its maximum degree ∆(G).

More specifically, one can obtain a uniform LP of size

O((nw/ϵ)O(k·tw(G)·∆(G)) (|E(G)|+ D)

)

22

Goal of this talkWhat we know for Neural NetsOur HammerArchitecture-Specific Consequences

principled deep neural network training through linear … · 2019. 1. 10. · principled deep...

Documents