principled deep neural network training through linear … · 2019. 1. 10. · principled deep...

55
Principled Deep Neural Network Training through Linear Programming Daniel Bienstock 1 , Gonzalo Muñoz 2 , Sebastian Pokutta 3 January 9, 2019 1 IEOR, Columbia University 2 IVADO, Polytechnique Montréal 3 ISyE, Georgia Tech 1

Upload: others

Post on 17-Feb-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

  • Principled Deep Neural Network Trainingthrough Linear Programming

    Daniel Bienstock1, Gonzalo Muñoz2, Sebastian Pokutta3

    January 9, 20191IEOR, Columbia University

    2IVADO, Polytechnique Montréal

    3ISyE, Georgia Tech

    1

  • “...I’m starting to look at machine learning problems”

    Oktay Günlük’s research interests, Aussois 2019

    2

  • Goal of this talk

  • Goal of this talk

    • Deep Learning is receiving significant attention due to its impressiveperformance.

    • Unfortunately, only recent results regarding the complexity oftraining deep neural networks have been obtained.

    • Our goal: to show that large classes of Neural Networks can betrained to near optimality using linear programs whose size is linearon the data.

    3

  • Goal of this talk

    • Deep Learning is receiving significant attention due to its impressiveperformance.

    • Unfortunately, only recent results regarding the complexity oftraining deep neural networks have been obtained.

    • Our goal: to show that large classes of Neural Networks can betrained to near optimality using linear programs whose size is linearon the data.

    3

  • Goal of this talk

    • Deep Learning is receiving significant attention due to its impressiveperformance.

    • Unfortunately, only recent results regarding the complexity oftraining deep neural networks have been obtained.

    • Our goal: to show that large classes of Neural Networks can betrained to near optimality using linear programs whose size is linearon the data.

    3

  • Empirical Risk Minimization problem

    Given:

    • D data points (x̂i, ŷi), i = 1, . . . ,D• x̂i ∈ Rn, ŷi ∈ Rm

    • A loss function ℓ : Rm × Rm → R (not necessarily convex)

    Compute f : Rn → Rm to solve

    minf

    1D

    D∑i=1

    ℓ(f(x̂i), ŷi) (+ optional regularizer Φ(f))

    f ∈ F (some class)

    4

  • Empirical Risk Minimization problem

    Given:

    • D data points (x̂i, ŷi), i = 1, . . . ,D• x̂i ∈ Rn, ŷi ∈ Rm

    • A loss function ℓ : Rm × Rm → R (not necessarily convex)

    Compute f : Rn → Rm to solve

    minf

    1D

    D∑i=1

    ℓ(f(x̂i), ŷi) (+ optional regularizer Φ(f))

    f ∈ F (some class)

    4

  • Empirical Risk Minimization problem

    Given:

    • D data points (x̂i, ŷi), i = 1, . . . ,D• x̂i ∈ Rn, ŷi ∈ Rm

    • A loss function ℓ : Rm × Rm → R (not necessarily convex)

    Compute f : Rn → Rm to solve

    minf

    1D

    D∑i=1

    ℓ(f(x̂i), ŷi) (+ optional regularizer Φ(f))

    f ∈ F (some class)

    4

  • Empirical Risk Minimization problem

    minf

    1D

    D∑i=1

    ℓ(f(x̂i), ŷi) (+ optional regularizer Φ(f))

    f ∈ F (some class)

    Examples:

    • Linear Regression. f(x) = Ax + b with ℓ2-loss.• Binary Classification. Varying f architectures and cross-entropy loss:

    ℓ(p, y) = −y log(p)− (1 − y) log(1 − p)• Neural Networks with k layers.

    f(x) = Tk+1 ◦ σ ◦ Tk ◦ σ . . . ◦ σ ◦ T1(x), each Tj affine.

    5

  • Function parameterization

    We assume family F (statisticians’ hypothesis) is parameterized: thereexists f such that

    F = {f(x, θ) : θ ∈ Θ ⊆ [−1, 1]N}.

    Thus, THE problem becomes

    minθ∈Θ

    1D

    D∑i=1

    ℓ(f(x̂i, θ), ŷi)

    6

  • Function parameterization

    We assume family F (statisticians’ hypothesis) is parameterized: thereexists f such that

    F = {f(x, θ) : θ ∈ Θ ⊆ [−1, 1]N}.

    Thus, THE problem becomes

    minθ∈Θ

    1D

    D∑i=1

    ℓ(f(x̂i, θ), ŷi)

    6

  • What we know for Neural Nets

  • Neural Networks

    • D data points (x̂i, ŷi), 1 ≤ i ≤ D, x̂i ∈ Rn, ŷi ∈ Rm

    • f = Tk+1 ◦ σ ◦ Tk ◦ σ . . . ◦ σ ◦ T1• Each Ti affine Ti(y) = Aiy + bi• A1 is n × w, Ak+1 is w × m, Ai is w × w otherwise.

    ...

    n w w m

    7

  • Neural Networks

    • D data points (x̂i, ŷi), 1 ≤ i ≤ D, x̂i ∈ Rn, ŷi ∈ Rm

    • f = Tk+1 ◦ σ ◦ Tk ◦ σ . . . ◦ σ ◦ T1

    • Each Ti affine Ti(y) = Aiy + bi• A1 is n × w, Ak+1 is w × m, Ai is w × w otherwise.

    ...

    n w w m

    7

  • Neural Networks

    • D data points (x̂i, ŷi), 1 ≤ i ≤ D, x̂i ∈ Rn, ŷi ∈ Rm

    • f = Tk+1 ◦ σ ◦ Tk ◦ σ . . . ◦ σ ◦ T1• Each Ti affine Ti(y) = Aiy + bi

    • A1 is n × w, Ak+1 is w × m, Ai is w × w otherwise.

    ...

    n w w m

    7

  • Neural Networks

    • D data points (x̂i, ŷi), 1 ≤ i ≤ D, x̂i ∈ Rn, ŷi ∈ Rm

    • f = Tk+1 ◦ σ ◦ Tk ◦ σ . . . ◦ σ ◦ T1• Each Ti affine Ti(y) = Aiy + bi• A1 is n × w, Ak+1 is w × m, Ai is w × w otherwise.

    ...

    n w w m

    7

  • Hardness Results

    Theorem (Blum and Rivest 1992)Let x̂i ∈ Rn, ŷi ∈ {0, 1}, ℓ ∈ (absolute value, 2-norm squared) and σ athreshold function. Then training is NP-hard even in this simple network:

    ...

    Theorem (Boob, Dey and Lan 2018)Let x̂i ∈ Rn, ŷi ∈ {0, 1}, ℓ a norm and σ(t) = max{0, t} a ReLUactivation. Then training is NP-hard in the same network.

    8

  • Exact Training Complexity

    Theorem (Arora, Basu, Mianjy and Mukherjee 2018)If k = 1 (one “hidden layer”), m = 1 and ℓ is convex, there is an exacttraining algorithm of complexity

    O ( 2wDnwpoly(D, n,w) )

    Polynomial in the size of the data set, for fixed n,w.

    Also in that paper:“we are not aware of any complexity results which would rule out thepossibility of an algorithm which trains to global optimality in timethat is polynomial in the data size”

    “Perhaps an even better breakthrough would be to get optimal trainingalgorithms for DNNs with two or more hidden layers and this seemslike a substantially harder nut to crack”

    9

  • Exact Training Complexity

    Theorem (Arora, Basu, Mianjy and Mukherjee 2018)If k = 1 (one “hidden layer”), m = 1 and ℓ is convex, there is an exacttraining algorithm of complexity

    O ( 2wDnwpoly(D, n,w) )

    Polynomial in the size of the data set, for fixed n,w.

    Also in that paper:“we are not aware of any complexity results which would rule out thepossibility of an algorithm which trains to global optimality in timethat is polynomial in the data size”

    “Perhaps an even better breakthrough would be to get optimal trainingalgorithms for DNNs with two or more hidden layers and this seemslike a substantially harder nut to crack”

    9

  • Exact Training Complexity

    Theorem (Arora, Basu, Mianjy and Mukherjee 2018)If k = 1 (one “hidden layer”), m = 1 and ℓ is convex, there is an exacttraining algorithm of complexity

    O ( 2wDnwpoly(D, n,w) )

    Polynomial in the size of the data set, for fixed n,w.

    Also in that paper:“we are not aware of any complexity results which would rule out thepossibility of an algorithm which trains to global optimality in timethat is polynomial in the data size”

    “Perhaps an even better breakthrough would be to get optimal trainingalgorithms for DNNs with two or more hidden layers and this seemslike a substantially harder nut to crack”

    9

  • What we’ll prove

    There exists a polytope:

    whose size depends linearly on D that encodes approximately all possibletraining problems coming from (x̂i, ŷi)Di=1 ⊆ [−1, 1](n+m)D.

    Spoiler: Theory-only results

    10

  • What we’ll prove

    There exists a polytope:

    whose size depends linearly on D

    that encodes approximately all possibletraining problems coming from (x̂i, ŷi)Di=1 ⊆ [−1, 1](n+m)D.

    Spoiler: Theory-only results

    10

  • What we’ll prove

    There exists a polytope:

    whose size depends linearly on D that encodes approximately all possibletraining problems coming from (x̂i, ŷi)Di=1 ⊆ [−1, 1](n+m)D.

    Spoiler: Theory-only results

    10

  • What we’ll prove

    There exists a polytope:

    whose size depends linearly on D that encodes approximately all possibletraining problems coming from (x̂i, ŷi)Di=1 ⊆ [−1, 1](n+m)D.

    Spoiler: Theory-only results

    10

  • Our Hammer

  • Treewidth

    Treewidth is a parameter that measures how tree-like a graph is.

    DefinitionGiven a chordal graph G, we say its treewidth is ω if its clique number isω + 1.

    • Trees have treewidth 1• Cycles have treewidth 2• Kn has treewidth n − 1

    11

  • Treewidth

    Treewidth is a parameter that measures how tree-like a graph is.

    DefinitionGiven a chordal graph G, we say its treewidth is ω if its clique number isω + 1.

    • Trees have treewidth 1• Cycles have treewidth 2• Kn has treewidth n − 1

    11

  • Treewidth

    Treewidth is a parameter that measures how tree-like a graph is.

    DefinitionGiven a chordal graph G, we say its treewidth is ω if its clique number isω + 1.

    • Trees have treewidth 1• Cycles have treewidth 2• Kn has treewidth n − 1

    11

  • Approximate optimization of well-behaved functions

    Prototype problem:

    min cTxs.t. fi(x) ≤ 0, i = 1, . . . ,m

    x ∈ [0, 1]n

    Toolset:

    • Each fi is “well-behaved”: Lipschitz constant Li over [0, 1]n• Intersection graph: An edge whenever two variables appear in the

    same fi

    For example:

    x1 + x2 + x3 ≤ 1x3 + x4 ≥ 1

    x4 · x5 + x6 ≤ 2

    The intersection graph is:

    1

    2

    3 4

    5

    6

    12

  • Approximate optimization of well-behaved functions

    Prototype problem:

    min cTxs.t. fi(x) ≤ 0, i = 1, . . . ,m

    x ∈ [0, 1]n

    Toolset:

    • Each fi is “well-behaved”: Lipschitz constant Li over [0, 1]n

    • Intersection graph: An edge whenever two variables appear in thesame fi

    For example:

    x1 + x2 + x3 ≤ 1x3 + x4 ≥ 1

    x4 · x5 + x6 ≤ 2

    The intersection graph is:

    1

    2

    3 4

    5

    6

    12

  • Approximate optimization of well-behaved functions

    Prototype problem:

    min cTxs.t. fi(x) ≤ 0, i = 1, . . . ,m

    x ∈ [0, 1]n

    Toolset:

    • Each fi is “well-behaved”: Lipschitz constant Li over [0, 1]n• Intersection graph: An edge whenever two variables appear in the

    same fi

    For example:

    x1 + x2 + x3 ≤ 1x3 + x4 ≥ 1

    x4 · x5 + x6 ≤ 2

    The intersection graph is:

    1

    2

    3 4

    5

    6

    12

  • Approximate optimization of well-behaved functions

    Prototype problem:

    min cTxs.t. fi(x) ≤ 0, i = 1, . . . ,m

    x ∈ [0, 1]n

    Toolset:

    • Each fi is “well-behaved”: Lipschitz constant Li over [0, 1]n• Intersection graph: An edge whenever two variables appear in the

    same fi

    For example:

    x1 + x2 + x3 ≤ 1x3 + x4 ≥ 1

    x4 · x5 + x6 ≤ 2

    The intersection graph is:

    1

    2

    3 4

    5

    6

    12

  • Approximate optimization of well-behaved functions

    Prototype problem:

    min cTxs.t. fi(x) ≤ 0, i = 1, . . . ,m

    x ∈ [0, 1]n

    Toolset:

    • Each fi is “well-behaved”: Lipschitz constant Li over [0, 1]n• Intersection graph: An edge whenever two variables appear in the

    same fi

    For example:

    x1 + x2 + x3 ≤ 1x3 + x4 ≥ 1

    x4 · x5 + x6 ≤ 2

    The intersection graph is:

    1

    2

    3 4

    5

    6

    12

  • Approximate optimization of well-behaved functions

    Prototype problem:

    min cTxs.t. fi(x) ≤ 0, i = 1, . . . ,m

    x ∈ [0, 1]n

    An extension of result by Bienstock and M. 2018:TheoremSuppose the intersection graph has tree-width ω and let L = maxi Li.

    Then, for every ϵ > 0 there is an LP relaxation of size

    O((L/ϵ)ω+1 n

    )that guarantees ϵ optimality and feasibility errors.

    13

  • Approximate optimization of well-behaved functions

    Prototype problem:

    min cTxs.t. fi(x) ≤ 0, i = 1, . . . ,m

    x ∈ [0, 1]n

    An extension of result by Bienstock and M. 2018:TheoremSuppose the intersection graph has tree-width ω and let L = maxi Li.Then, for every ϵ > 0 there is an LP relaxation of size

    O((L/ϵ)ω+1 n

    )that guarantees ϵ optimality and feasibility errors.

    13

  • Application to ERM problem

    We now apply the LP approximation result to:

    minθ∈Θ

    1D

    D∑i=1

    ℓ(f(x̂i, θ), ŷi)

    with Θ ⊆ [−1, 1]N, x̂i ∈ [−1, 1]n and ŷi ∈ [−1, 1]m.

    We use the epigraphformulation:

    minθ∈Θ

    1D

    D∑i=1

    Li

    Li ≥ ℓ(f(x̂i, θ), ŷi) 1 ≤ i ≤ D

    Let L be the Lipschitz constant of g(x, y, θ) .= ℓ(f(x, θ), y) over[−1, 1]n+m+N.

    14

  • Application to ERM problem

    We now apply the LP approximation result to:

    minθ∈Θ

    1D

    D∑i=1

    ℓ(f(x̂i, θ), ŷi)

    with Θ ⊆ [−1, 1]N, x̂i ∈ [−1, 1]n and ŷi ∈ [−1, 1]m. We use the epigraphformulation:

    minθ∈Θ

    1D

    D∑i=1

    Li

    Li ≥ ℓ(f(x̂i, θ), ŷi) 1 ≤ i ≤ D

    Let L be the Lipschitz constant of g(x, y, θ) .= ℓ(f(x, θ), y) over[−1, 1]n+m+N.

    14

  • Application to ERM problem

    TheoremFor every ϵ > 0, ℓ, Θ ⊆ [−1, 1]N and D, there is a polytope of size

    O((2L/ϵ)N+n+m D

    )

    such that for every data set (X̂, Ŷ) = (x̂i, ŷi)Di=1 ⊆ [−1, 1](n+m)D, there isa face FX̂,Ŷ such that optimizing 1D

    ∑Di=1 Li over FX̂,Ŷ provides an

    ϵ-approximation to ERM with data X̂, Ŷ.

    15

  • Application to ERM problem

    TheoremFor every ϵ > 0, ℓ, Θ ⊆ [−1, 1]N and D, there is a polytope of size

    O((2L/ϵ)N+n+m D

    )such that for every data set (X̂, Ŷ) = (x̂i, ŷi)Di=1 ⊆ [−1, 1](n+m)D, there isa face FX̂,Ŷ

    such that optimizing 1D∑D

    i=1 Li over FX̂,Ŷ provides anϵ-approximation to ERM with data X̂, Ŷ.

    15

  • Application to ERM problem

    TheoremFor every ϵ > 0, ℓ, Θ ⊆ [−1, 1]N and D, there is a polytope of size

    O((2L/ϵ)N+n+m D

    )such that for every data set (X̂, Ŷ) = (x̂i, ŷi)Di=1 ⊆ [−1, 1](n+m)D, there isa face FX̂,Ŷ such that optimizing 1D

    ∑Di=1 Li over FX̂,Ŷ provides an

    ϵ-approximation to ERM with data X̂, Ŷ.

    15

  • Proof Sketch

    Every system of constraints of the type

    Li ≥ ℓ(f(xi, θ), yi) 1 ≤ i ≤ D

    has an intersection graph with the following structure:

    θ1, · · · , θN

    L1x1, y1

    L2x2, y2

    L3x3, y3

    LDxD, yD

    L4x4, y4

    and has treewidth at most N + n + m

    16

  • Proof Sketch

    Every system of constraints of the type

    Li ≥ ℓ(f(xi, θ), yi) 1 ≤ i ≤ D

    has an intersection graph with the following structure:

    θ1, · · · , θN

    L1x1, y1

    L2x2, y2

    L3x3, y3

    LDxD, yD

    L4x4, y4

    and has treewidth at most N + n + m16

  • LP size details

    Thus the LP size given by the treewidth

    O((L/ϵ)ω+1 n

    )becomes

    O((2L/ϵ)N+n+m D

    )

    The key lies in the fact that the D does not add to the treewidth.

    Different architectures → N and L.

    17

  • LP size details

    Thus the LP size given by the treewidth

    O((L/ϵ)ω+1 n

    )becomes

    O((2L/ϵ)N+n+m D

    )

    The key lies in the fact that the D does not add to the treewidth.

    Different architectures → N and L.

    17

  • LP size details

    Thus the LP size given by the treewidth

    O((L/ϵ)ω+1 n

    )becomes

    O((2L/ϵ)N+n+m D

    )

    The key lies in the fact that the D does not add to the treewidth.

    Different architectures → N and L.

    17

  • Architecture-SpecificConsequences

  • Fully connected DNN, ReLU activations, quadratic loss

    For any k, n,m,w, ϵ there is a uniform LP of size

    O((2k+1mnwk2/ϵ)N+n+m D

    )with the same guarantees: ϵ-approximation and data-dependent faces

    Core of the proof: In a DNN with k hidden layers and quadratic loss theLipschitz constant of g(x, y, θ) over [−1, 1]n+m+N is O(mnwk2).

    18

  • Fully connected DNN, ReLU activations, quadratic loss

    For any k, n,m,w, ϵ there is a uniform LP of size

    O((2k+1mnwk2/ϵ)N+n+m D

    )with the same guarantees: ϵ-approximation and data-dependent faces

    Core of the proof: In a DNN with k hidden layers and quadratic loss theLipschitz constant of g(x, y, θ) over [−1, 1]n+m+N is O(mnwk2).

    18

  • Comparison with Arora et al.

    In the Arora, Basu, Mianjy and Mukherjee setting: k = 1,m = 1 andN ≈ nw

    Arora et al. Running TimeO ( 2wDnwpoly(D, n,w) )

    Uniform LP SizeO((4nw/ϵ)(n+1)(w+1) D

    )

    Other differences:exactness, boundedness, convexity v lipschitz-ness, uniformness

    19

  • Last comments

    • The results can be improved by considering the sparsity of thenetwork itself.

    • One can obtain previously unknown complexity results (ResNet,Convolutional NN, etc)

    • Training using this approach generalizes. Meaning, using enough1i.i.d data points we get an approximation to the “true” RiskMinimization problem. Our results improve on the bestapproximations to this problem as well.

    1depends on L and ϵ

    20

  • Still Open and Future Work

    • It is unknown if the dependency on w or k can be improved• A better LP size can be obtained assuming more about the input

    data or the nature of the problem• We would like to combine these ideas with empirically efficient

    methods

    21

  • Thank you!

    21

  • One other improvement

    If we denote G the underlying Neural Network, we can improve theexponent in

    O((nw/ϵ)poly(n,k,w,m) D

    )using the treewidth of G tw(G), and its maximum degree ∆(G).

    More specifically, one can obtain a uniform LP of size

    O((nw/ϵ)O(k·tw(G)·∆(G)) (|E(G)|+ D)

    )

    22

  • One other improvement

    If we denote G the underlying Neural Network, we can improve theexponent in

    O((nw/ϵ)poly(n,k,w,m) D

    )using the treewidth of G tw(G), and its maximum degree ∆(G).

    More specifically, one can obtain a uniform LP of size

    O((nw/ϵ)O(k·tw(G)·∆(G)) (|E(G)|+ D)

    )

    22

    Goal of this talkWhat we know for Neural NetsOur HammerArchitecture-Specific Consequences