ricky t. q. chen and david duvenaudrtqichen/posters/diffopnet_poster.pdfricky t. q. chen and david...

1
Neural Networks with Cheap Differential Operators Ricky T. Q. Chen and David Duvenaud University of Toronto, Vector Institute Overview Given a function f : R d R d , we seek to obtain a vector containing its dimension-wise k -th order derivatives, D k dim f (x ) := h k f 1 (x ) x k 1 ··· k f d (x ) x k d i T R d (1) using only k evaluations of automatic differentiation regardless of the dimension d . Full Jacobian of f (x ) D dim f (x ) (diag of Jac) This has applications in: I. Solving differential equations. II. Continuous Normalizing Flows. III. Learning stochastic differential equations. Reverse-mode Automatic Differentiation Deep learning software rely on automatic differentiation (AD) to compute gradients. More generally, AD computes vector-Jacobian products. v T f (x ) x = d X i =1 v i f i (x ) x (2) However, computing the Jacobian trace (ie. D dim ) is as expensive as the full Jacobian. One evaluation of AD can only compute one row of the Jacobian. Forward: Network Structure Our approach: Dimension-wise derivatives can be efficiently computed for restricted architectures with simple modifications to the AD procedure in the backward pass. Conditioner h i = c i (x -i ). The i -th hidden dimension depends on all inputs except the i -th input dimension. Can be computed in parallel using masked neural networks (e.g. MADE, PixelCNN). Transformer f i (x )= τ i (x i , h i ). τ i : R d R outputs the i -th dimension given the concatenated vector. Can be computed in parallel if composed of matrix multiplication and element-wise operations. Backward: Modified Computation Graph x 1 x 2 x d . . . h 1 h 2 h d . . . f 1 f 2 f d . . . Forward computation graph x 1 x 2 x d . . . h 1 h 2 h d . . . f 1 f 2 f d . . . Backward computation graph Let ˆ h be stop_gradient(h ) and ˆ f = τ (x , ˆ h ), so ˆ f i (x ) x j = ∂τ i (x i , ˆ h i ) x j = ( f i (x ) x i if i = j 0 if i 6= j (3) Dimension-wise derivatives. 1 T ˆ f (x ) x = D dim ˆ f = h f 1 (x ) x 1 ··· f d (x ) x d i T = D dim f (4) Higher orders. 1 T D k -1 dim ˆ f (x ) x = D k dim ˆ f (x )= D k dim f (x ) (5) Backpropagating through dim-wise derivatives. D k dim ˆ f w + D k dim ˆ f ˆ h h w = D k dim f w (6) App I: Linear Multistep ODE Solvers Implicit ODE solvers need to solve an optimization sub-problem in the inner loop. General idea: replace Newton-Raphson y (k +1) = y (k ) - " F (y (k ) ) y (k ) # -1 F (y (k ) ) (7) with Jacobi-Newton y (k +1) = y (k ) - [D dim F (y )] -1 F (y (k ) ) (8) 0 2000 4000 6000 8000 10000 12000 Training Iteration 0 500 1000 1500 2000 Num. Evaluations RK4(5) ABM ABM-Jacobi App II: Continuous Normalizing Flows If dx dt = f (t , x ), then log p (x ) t = - tr ( f x ) . (See ”Neural ODEs”) Model MNIST Omniglot ELBO NLL ELBO NLL VAE -86.55 82.14 -104.28 97.25 Planar -86.06 81.91 -102.65 96.04 IAF -84.20 80.79 -102.41 96.08 Sylvester -83.32 80.22 -99.00 93.77 FFJORD -82.82 - -98.33 - DiffOp-CNF -82.37 80.22 -97.42 93.90 Table: Evidence lower bound and negative log-likelihood for static MNIST and Omniglot. Exact trace vs stochastic trace. Converges faster and results in easier to solve dynamical systems. App III: Stochastic Differential Eqs dx (t )= f (x (t ), t )dt + g (x (t ), t )dW (9) Idea: match left- and right-hand-side of the Fokker-Planck equation, which describes the change in density. p (t , x ) t = - d X i =1 x i [f i (t , x )p (t , x )] + 1 2 d X i =1 2 x 2 i g 2 ii (t , x )p (t , x ) (10) References Germain et al. “MADE: Masked autoencoder for distribution estimation.” (2015) Huang et al. “Neural Autoregressive Flows.” (2018) Chen et al. “Neural ODEs.” (2018)

Upload: others

Post on 06-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ricky T. Q. Chen and David Duvenaudrtqichen/posters/diffopnet_poster.pdfRicky T. Q. Chen and David Duvenaud University of Toronto, Vector Institute Overview Given a function f : Rd!Rd,

Neural Networks with Cheap Differential OperatorsRicky T. Q. Chen and David Duvenaud

University of Toronto, Vector Institute

Overview

Given a function f : Rd → Rd , we seek to obtain a vectorcontaining its dimension-wise k-th order derivatives,

Dkdimf (x) :=

[∂kf1(x)

∂xk1· · · ∂

kfd(x)

∂xkd

]T∈ Rd (1)

using only k evaluations of automatic differentiation regardlessof the dimension d .

Full Jacobianof f (x)

Ddimf (x)(diag of Jac)

This has applications in:

I. Solving differential equations.

II. Continuous Normalizing Flows.

III. Learningstochastic differential equations.

Reverse-mode Automatic Differentiation

Deep learning software rely on automatic differentiation (AD)to compute gradients. More generally, AD computesvector-Jacobian products.

vT∂f (x)

∂x=

d∑i=1

vi∂fi(x)

∂x(2)

However, computing the Jacobian trace (ie.∑Ddim)

is as expensive as the full Jacobian. One evaluation ofAD can only compute one row of the Jacobian.

Forward: Network Structure

Our approach:Dimension-wise derivatives can be efficiently computed forrestricted architectures with simple modifications to the ADprocedure in the backward pass.

Conditioner hi = ci(x−i). The i -th hidden dimensiondepends on all inputs except the i -th input dimension. Can becomputed in parallel using masked neural networks (e.g.MADE, PixelCNN).

Transformer fi(x) = τi(xi , hi). τi : Rd → R outputs the i -thdimension given the concatenated vector. Can be computed inparallel if composed of matrix multiplication and element-wiseoperations.

Backward: Modified Computation Graph

x1

x2

xd

...

h1

h2

hd

...

f1

f2

fd

...

Forward computation graph

x1

x2

xd

...

h1

h2

hd

...

f1

f2

fd

...

Backward computation graph

Let h be stop_gradient(h) and f = τ (x , h), so

∂ f i(x)

∂xj=∂τi(xi , hi)

∂xj=

{∂fi(x)∂xi

if i = j

0 if i 6= j(3)

Dimension-wise derivatives.

1T ∂ f (x)

∂x= Ddimf =

[∂f1(x)∂x1· · · ∂fd(x)

∂xd

]T= Ddimf (4)

Higher orders.

1T ∂Dk−1dim f (x)

∂x= Dk

dimf (x) = Dkdimf (x) (5)

Backpropagating through dim-wise derivatives.

∂Dkdimf

∂w+∂Dk

dimf

∂h

∂h

∂w=∂Dk

dimf

∂w(6)

App I: Linear Multistep ODE Solvers

Implicit ODE solvers need to solve an optimizationsub-problem in the inner loop.

General idea: replace Newton-Raphson

y (k+1) = y (k) −

[∂F (y (k))

∂y (k)

]−1

F (y (k)) (7)

with Jacobi-Newton

y (k+1) = y (k) − [DdimF (y)]−1 � F (y (k)) (8)

0 2000 4000 6000 8000 10000 12000

Training Iteration

0

500

1000

1500

2000

Num

.E

valu

atio

ns

RK4(5)

ABM

ABM-Jacobi

App II: Continuous Normalizing Flows

If dxdt = f (t, x), then ∂ log p(x)

∂t = − tr(∂f∂x

). (See ”Neural ODEs”)

ModelMNIST Omniglot

ELBO ↑ NLL ↓ ELBO ↑ NLL ↓VAE -86.55 82.14 -104.28 97.25

Planar -86.06 81.91 -102.65 96.04

IAF -84.20 80.79 -102.41 96.08

Sylvester -83.32 80.22 -99.00 93.77

FFJORD -82.82 − -98.33 −DiffOp-CNF -82.37 80.22 -97.42 93.90

Table: Evidence lower bound and negative log-likelihood for staticMNIST and Omniglot.

Exact trace vs stochastic trace. Converges faster andresults in easier to solve dynamical systems.

App III: Stochastic Differential Eqs

dx(t) = f (x(t), t)dt + g(x(t), t)dW (9)

Data Density Learned

Idea: match left- and right-hand-side of the Fokker-Planckequation, which describes the change in density.

∂p(t, x)

∂t= −

d∑i=1

∂xi[fi(t, x)p(t, x)] +

1

2

d∑i=1

∂2

∂x2i

[g 2ii (t, x)p(t, x)

](10)

References

• Germain et al. “MADE: Masked autoencoder for distributionestimation.” (2015)

• Huang et al. “Neural Autoregressive Flows.” (2018)

• Chen et al. “Neural ODEs.” (2018)