ricky t. q. chen and david duvenaudrtqichen/posters/diffopnet_poster.pdfricky t. q. chen and david...

Neural Networks with Cheap Differential OperatorsRicky T. Q. Chen and David Duvenaud

University of Toronto, Vector Institute

Overview

Given a function f : Rd → Rd , we seek to obtain a vectorcontaining its dimension-wise k-th order derivatives,

Dkdimf (x) :=

[∂kf1(x)

∂xk1· · · ∂

kfd(x)

∂xkd

]T∈ Rd (1)

using only k evaluations of automatic differentiation regardlessof the dimension d .

Full Jacobianof f (x)

Ddimf (x)(diag of Jac)

This has applications in:

I. Solving differential equations.

II. Continuous Normalizing Flows.

III. Learningstochastic differential equations.

Reverse-mode Automatic Differentiation

Deep learning software rely on automatic differentiation (AD)to compute gradients. More generally, AD computesvector-Jacobian products.

vT∂f (x)

∂x=

d∑i=1

vi∂fi(x)

∂x(2)

However, computing the Jacobian trace (ie.∑Ddim)

is as expensive as the full Jacobian. One evaluation ofAD can only compute one row of the Jacobian.

Forward: Network Structure

Our approach:Dimension-wise derivatives can be efficiently computed forrestricted architectures with simple modifications to the ADprocedure in the backward pass.

Conditioner hi = ci(x−i). The i -th hidden dimensiondepends on all inputs except the i -th input dimension. Can becomputed in parallel using masked neural networks (e.g.MADE, PixelCNN).

Transformer fi(x) = τi(xi , hi). τi : Rd → R outputs the i -thdimension given the concatenated vector. Can be computed inparallel if composed of matrix multiplication and element-wiseoperations.

Backward: Modified Computation Graph

x1

x2

xd

...

h1

h2

hd

...

f1

f2

fd

...

Forward computation graph

x1

x2

xd

...

h1

h2

hd

...

f1

f2

fd

...

Backward computation graph

Let h be stop_gradient(h) and f = τ (x , h), so

∂ f i(x)

∂xj=∂τi(xi , hi)

∂xj=

{∂fi(x)∂xi

if i = j

0 if i 6= j(3)

Dimension-wise derivatives.

1T ∂ f (x)

∂x= Ddimf =

[∂f1(x)∂x1· · · ∂fd(x)

∂xd

]T= Ddimf (4)

Higher orders.

1T ∂Dk−1dim f (x)

∂x= Dk

dimf (x) = Dkdimf (x) (5)

Backpropagating through dim-wise derivatives.

∂Dkdimf

∂w+∂Dk

dimf

∂h

∂h

∂w=∂Dk

dimf

∂w(6)

App I: Linear Multistep ODE Solvers

Implicit ODE solvers need to solve an optimizationsub-problem in the inner loop.

General idea: replace Newton-Raphson

y (k+1) = y (k) −

[∂F (y (k))

∂y (k)

]−1

F (y (k)) (7)

with Jacobi-Newton

y (k+1) = y (k) − [DdimF (y)]−1 � F (y (k)) (8)

0 2000 4000 6000 8000 10000 12000

Training Iteration

0

500

1000

1500

2000

Num

.E

valu

atio

ns

RK4(5)

ABM

ABM-Jacobi

App II: Continuous Normalizing Flows

If dxdt = f (t, x), then ∂ log p(x)

∂t = − tr(∂f∂x

). (See ”Neural ODEs”)

ModelMNIST Omniglot

ELBO ↑ NLL ↓ ELBO ↑ NLL ↓VAE -86.55 82.14 -104.28 97.25

Planar -86.06 81.91 -102.65 96.04

IAF -84.20 80.79 -102.41 96.08

Sylvester -83.32 80.22 -99.00 93.77

FFJORD -82.82 − -98.33 −DiffOp-CNF -82.37 80.22 -97.42 93.90

Table: Evidence lower bound and negative log-likelihood for staticMNIST and Omniglot.

Exact trace vs stochastic trace. Converges faster andresults in easier to solve dynamical systems.

App III: Stochastic Differential Eqs

dx(t) = f (x(t), t)dt + g(x(t), t)dW (9)

Data Density Learned

Idea: match left- and right-hand-side of the Fokker-Planckequation, which describes the change in density.

∂p(t, x)

∂t= −

d∑i=1

∂

∂xi[fi(t, x)p(t, x)] +

1

2

d∑i=1

∂2

∂x2i

[g 2ii (t, x)p(t, x)

](10)

References

• Germain et al. “MADE: Masked autoencoder for distributionestimation.” (2015)

• Huang et al. “Neural Autoregressive Flows.” (2018)

• Chen et al. “Neural ODEs.” (2018)

ricky t. q. chen and david duvenaudrtqichen/posters/diffopnet_poster.pdfricky t. q. chen and david...

Documents