2
Outline
This lecture:- Derivatives in machine learning- Review of essential concepts (what is a derivative, Jacobian, etc.)- How do we compute derivatives- Automatic differentiation
Next lecture:- Current landscape of tools- Implementation techniques- Advanced concepts (higher-order API, checkpointing, etc.)
Derivatives and machine learning
3
4
Derivatives in machine learning“Backprop” and gradient descent are at the core of all recent advances Computer vision
NVIDIA DRIVE PX 2 segmentationTop-5 error rate for ImageNet (NVIDIA devblog) Faster R-CNN (Ren et al. 2015)
Speech recognition/synthesis
Word error rates (Huang et al., 2014) Google Neural Machine Translation System (GNMT)
Machine translation
5
Derivatives in machine learning“Backprop” and gradient descent are at the core of all recent advances
Probabilistic programming (and modeling)
Pyro ProbTorch
Edward TensorFlow Probability
- Variational inference- “Neural” density estimation
- Transformed distributions via bijectors- Normalizing flows (Rezende & Mohamed, 2015)- Masked autoregressive flows (Papamakarios et al., 2017)
(2017) (2017)
(2016) (2018)
6
Derivatives in machine learningAt the core of all: differentiable functions (programs) whose parameters are tuned by gradient-based optimization
(Ruder, 2017) http://ruder.io/optimizing-gradient-descent/
7
Automatic differentiationExecute differentiable functions (programs) via automatic differentiation
A word on naming:- Differentiable programming, a generalization of deep learning (Olah, LeCun)
“Neural networks are just a class of differentiable functions”- Automatic differentiation- Algorithmic differentiation- AD- Autodiff- Algodiff- Autograd
Also remember:- Backprop- Backpropagation (backward propagation of errors)
Essential conceptsrefresher
8
9
DerivativeFunction of a real variable
Sensitivity of function value w.r.t. a change in its argument (the instantaneous rate of change)
Newton, c. 1665
Leibniz, c. 1675Leibniz Lagrange Newton
Dependent Independent
10
DerivativeFunction of a real variable
Newton, c. 1665
Leibniz, c. 1675
around 15 such rules
…
Note: the derivative is a linear operator, a.k.a. a higher-order function in programming languages
11
Partial derivativeFunction of several real variables
A derivative w.r.t. one independent variable, with others held constant
“del”
12
Partial derivativeFunction of several real variables
The gradient, given
is the vector of all partial derivatives
“nabla” or “del”
Nabla is the higher-order function:
points to the direction with the largest rate of change
13
Total derivativeFunction of several real variables
The derivative w.r.t. all variables (independent & dependent)
Consider all partial derivatives simultaneously and accumulate all direct and indirect contributions (Important: will be useful later)
14
Matrix calculus and machine learningExtension to multivariable functions
Scalar output Vector output
Scalar input
Vector input
scalar field vector field
In machine learning, we construct (deep) compositions of- , e.g., a neural network- , e.g., a loss function, KL divergence, or log joint probability
15
Matrix calculus and machine learning
Generalization to tensors (multi-dimensional arrays) for efficientbatching, handling of sequences, channels in convolutions, etc.
And many, many more rules
16
Finally, two constructs relevant to machine learning: Jacobian and Hessian
Matrix calculus and machine learning
How to compute derivatives
17
18
We can compute the derivatives not just of mathematical functions, but of general programs (with control flow)
Derivatives as code
19
Derivatives as code
20
Manual
Analytic derivatives are needed for theoretical insight- analytic solutions, proofs- mathematical analysis, e.g., stability of fixed points
Unnecessary when we just need derivative evaluations for optimization
You can see papers like this:
21
Symbolic differentiationSymbolic computation with Mathematica, Maple, Maxima, and deep learning frameworks such as TheanoProblem: expression swell
22
Symbolic computation with Mathematica, Maple, Maxima, and deep learning frameworks such as TheanoProblem: expression swell
Symbolic differentiation
Graph optimization(e.g., in Theano)
Symbolic graph builders such as Theano and TensorFlowhave limited, unintuitive control flow, loops, recursion
Problem: only applicable to closed-form mathematical functions
You can find the derivative of
but not of
Symbolic differentiation
24
Finite difference approximation of ,
Problem: we must select andwe face approximation errors
Problem: needs to be evaluated times, once with each standard basis vector
Numerical differentiation
25
Finite difference approximation of ,
Better approximations exist:- Higher-order finite differences
e.g., center difference:
- Richardson extrapolation- Differential quadrature
These increase rapidly in complexity and never completely eliminate the error
Numerical differentiation
26
Finite difference approximation of ,
Better approximations exist:- Higher-order finite differences
e.g., center difference:
- Richardson extrapolation- Differential quadrature
These increase rapidly in complexity and never completely eliminate the error
Numerical differentiation
Still extremely useful as a quick check of our gradient implementationsGood to learn:
27
If we don’t need analytic derivative expressions, we can evaluate a gradient exactly with only one forward and one reverse execution
In machine learning, this is known as backpropagation or “backprop”
- Automatic differentiation is more than backprop
- Or, backprop is a specialized reverse mode automatic differentiation
- We will come back to this shortly
Nature 323, 533–536 (9 October 1986)
Automatic differentiation
Backprob or automatic differentiation?
28
29
1960s 1970s 1980s
Precursors
Kelley, 1960Bryson, 1961Pontryagin et al., 1961Dreyfus, 1962
Wengert, 1964Forward mode
Linnainmaa, 1970, 1976Backpropagation
Dreyfus, 1973Control parameters
Werbos, 1974Reverse mode
Speelpenning, 1980Automatic reverse mode
Werbos, 1982First NN-specific backprop
Parker, 1985
LeCun, 1985
Rumelhart, Hinton, Williams, 1986Revived backprop
Griewank, 1989Revived reverse mode
1960s 1970s 1980s
Precursors
Kelley, 1960Bryson, 1961Pontryagin et al., 1961Dreyfus, 1962
Wengert, 1964Forward mode
Linnainmaa, 1970, 1976Backpropagation
Dreyfus, 1973Control parameters
Werbos, 1974Reverse mode
Speelpenning, 1980Automatic reverse mode
Werbos, 1982First NN-specific backprop
Parker, 1985
LeCun, 1985
Rumelhart, Hinton, Williams, 1986Revived backprop
Griewank, 1989Revived reverse mode 30
Recommended reading:
Griewank, A., 2012. Who Invented the Reverse Mode of Differentiation? Documenta Mathematica, Extra Volume ISMP, pp.389-400.
Schmidhuber, J., 2015. Who Invented Backpropagation?http://people.idsia.ch/~juergen/who-invented-backpropagation.html
Automatic differentiation
31
32
Automatic differentiationAll numerical algorithms, when executed, evaluate to compositions of a finite set of elementary operations with known derivatives
- Called a trace or a Wengert list (Wengert,1964)- Alternatively represented as a computational graph showing dependencies
33
Automatic differentiationAll numerical algorithms, when executed, evaluate to compositions of a finite set of elementary operations with known derivatives
- Called a trace or a Wengert list (Wengert,1964)- Alternatively represented as a computational graph showing dependencies
34
Automatic differentiationAll numerical algorithms, when executed, evaluate to compositions of a finite set of elementary operations with known derivatives
- Called a trace or a Wengert list (Wengert,1964)- Alternatively represented as a computational graph showing dependencies
f(a, b):
c = a * b
d = log(c)
return d log*
a
b
cd
35
Automatic differentiationAll numerical algorithms, when executed, evaluate to compositions of a finite set of elementary operations with known derivatives
- Called a trace or a Wengert list (Wengert,1964)- Alternatively represented as a computational graph showing dependencies
f(a, b):
c = a * b
d = log(c)
return d
1.791 = f(2, 3)
log*
a
b
cd
3
26
1.791
primal
36
Automatic differentiationAll numerical algorithms, when executed, evaluate to compositions of a finite set of elementary operations with known derivatives
- Called a trace or a Wengert list (Wengert,1964)- Alternatively represented as a computational graph showing dependencies
f(a, b):
c = a * b
d = log(c)
return d
1.791 = f(2, 3)
[0.5, 0.333] = f’(2, 3)
log*
a
b
cd
3
26
1.7910.5
0.333
0.166 1
derivativetangent, adjoint“gradient”
primal
37
Automatic differentiationAll numerical algorithms, when executed, evaluate to compositions of a finite set of elementary operations with known derivatives
- Called a trace or a Wengert list (Wengert,1964)- Alternatively represented as a computational graph showing dependencies
f(a, b):
c = a * b
d = log(c)
return d
1.791 = f(2, 3)
[0.5, 0.333] = f’(2, 3)
log*
a
b
cd
3
26
1.7910.5
0.333
0.166 1
derivativetangent, adjoint“gradient”
primal
38
Two main flavors
Forward mode Reverse mode (a.k.a. backprop)
Nested combinations (higher-order derivatives, Hessian–vector products, etc.)
- Forward-on-reverse- Reverse-on-forward- ...
PrimalsDerivatives
Primals
Derivatives
Automatic differentiation
(Tangents)(Adjoints)
39
What happens to control flow?
f(a, b):
c = a * b
if c > 0:
d = log(c)
else:
d = sin(c)
return d
It disappears: branches are taken, loops are unrolled, functions are inlined, etc.until we are left with the linear trace of execution
40
What happens to control flow?
f(a = 2, b = 3):
c = a * b = 6
if c > 0:
d = log(c) = 1.791
else:
d = sin(c)
return d
log*
a
b
cd
3
26
1.791
It disappears: branches are taken, loops are unrolled, functions are inlined, etc.until we are left with the linear trace of execution
41
What happens to control flow?
f(a = 2, b = -1):
c = a * b = -2
if c > 0:
d = log(c)
else:
d = sin(c) = -0.909
return d
sin*
a
b
cd
-1
2-2
-0.909
It disappears: branches are taken, loops are unrolled, functions are inlined, etc.until we are left with the linear trace of execution
42
What happens to control flow?
f(a = 2, b = -1):
c = a * b = -2
if c > 0:
d = log(c)
else:
d = sin(c) = -0.909
return d
sin*
a
b
cd
-1
2-2
-0.909
It disappears: branches are taken, loops are unrolled, functions are inlined, etc.until we are left with the linear trace of execution
A directed acyclic graph (DAG)
Topological ordering
43
Forward mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
sin*x1 y1
Primals: independent dependentDerivatives (tangents): independent dependent
log +x2
v2
y2
v1
44
Forward mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v1
Primals: independent dependentDerivatives (tangents): independent dependent
45
Forward mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v12
Primals: independent dependentDerivatives (tangents): independent dependent
46
Forward mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v12
3
Primals: independent dependentDerivatives (tangents): independent dependent
47
Forward mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v12
3
1
Primals: independent dependentDerivatives (tangents): independent dependent
48
Forward mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v12
3
1
0
Primals: independent dependentDerivatives (tangents): independent dependent
49
Forward mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v12
3
1
0
6
Primals: independent dependentDerivatives (tangents): independent dependent
50
Forward mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v12
3
1
0
6
3
Primals: independent dependentDerivatives (tangents): independent dependent
51
Forward mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v12
3
1
0
6
3
1.098
Primals: independent dependentDerivatives (tangents): independent dependent
52
Forward mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v12
3
1
0
6
3
1.098
0
Primals: independent dependentDerivatives (tangents): independent dependent
53
Forward mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v12
3
1
0
6
3
1.098
0
-0.279
Primals: independent dependentDerivatives (tangents): independent dependent
54
Forward mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v12
3
1
0
6
3
1.098
0
-0.279
2.880
Primals: independent dependentDerivatives (tangents): independent dependent
55
Forward mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v12
3
1
0
6
3
1.098
0
-0.279
2.880
7.098
Primals: independent dependentDerivatives (tangents): independent dependent
56
Forward mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v12
3
1
0
6
3
1.098
0
-0.279
2.880
7.098
3
Primals: independent dependentDerivatives (tangents): independent dependent
57
Forward mode
sin*x1 y1
log +x2
v2
y2
v12
3
1
0
6
3
1.098
0
-0.279
2.880
7.098
3
In general, forward mode evaluates a Jacobian–vector product
So we evaluated:
Primals: independent dependentDerivatives (tangents): independent dependent
58
Forward mode
sin*x1 y1
log +x2
v2
y2
v12
3
1
0
6
3
1.098
0
-0.279
2.880
7.098
3
In general, forward mode evaluates a Jacobian–vector product
So we evaluated:
Can be anynot only unit vectors
Primals: independent dependentDerivatives (tangents): independent dependent
For this is a directional derivative
59
Forward mode
In general, forward mode evaluates a Jacobian–vector product
So we evaluated:
Can be anynot only unit vectors
Primals: independent dependentDerivatives (tangents): independent dependent
60
Reverse mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
sin*x1 y1
log +x2
v2
y2
v1
Primals: independent dependentDerivatives (adjoints): independent dependent
61
Reverse mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v1
Primals: independent dependentDerivatives (adjoints): independent dependent
62
Reverse mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v12
Primals: independent dependentDerivatives (adjoints): independent dependent
63
Reverse mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v12
3
Primals: independent dependentDerivatives (adjoints): independent dependent
64
Reverse mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v12
3
6
Primals: independent dependentDerivatives (adjoints): independent dependent
65
Reverse mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v12
3
6
1.098
Primals: independent dependentDerivatives (adjoints): independent dependent
66
Reverse mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v12
3
6
1.098
-0.279
Primals: independent dependentDerivatives (adjoints): independent dependent
67
Reverse mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v12
3
6
1.098
-0.279
7.098
Primals: independent dependentDerivatives (adjoints): independent dependent
68
Reverse mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v12
3
6
1.098
-0.279
7.098
1
Primals: independent dependentDerivatives (adjoints): independent dependent
69
Reverse mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v12
3
6
1.098
-0.279
7.098
1
0
Primals: independent dependentDerivatives (adjoints): independent dependent
70
Reverse mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v12
3
6
1.098
-0.279
7.098
1
0
Primals: independent dependentDerivatives (adjoints): independent dependent
71
Reverse mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v12
3
6
1.098
-0.279
7.098
1
0
0.960
Primals: independent dependentDerivatives (adjoints): independent dependent
72
Reverse mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v12
3
6
1.098
-0.279
7.098
1
0
0.960
Primals: independent dependentDerivatives (adjoints): independent dependent
73
Reverse mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v12
3
6
1.098
-0.279
7.098
1
0
0.960
0
Primals: independent dependentDerivatives (adjoints): independent dependent
74
Reverse mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v12
3
6
1.098
-0.279
7.098
1
0
0.960
0
Primals: independent dependentDerivatives (adjoints): independent dependent
75
Reverse mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v12
3
6
1.098
-0.279
7.098
1
0
0.960
0
2.880
Primals: independent dependentDerivatives (adjoints): independent dependent
76
Reverse mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v12
3
6
1.098
-0.279
7.098
1
0
0.960
0
2.880
Primals: independent dependentDerivatives (adjoints): independent dependent
77
Reverse mode
f(x1, x2):
v1 = x1 * x2
v2 = log(x2)
y1 = sin(v1)
y2 = v1 + v2
return (y1, y2)
f(2, 3)
sin*x1 y1
log +x2
v2
y2
v12
3
6
1.098
-0.279
7.098
1
0
0.960
0
2.880
1.920
Primals: independent dependentDerivatives (adjoints): independent dependent
78
Reverse mode
sin*x1 y1
log +x2
v2
y2
v12
3
6
1.098
-0.279
7.098
1
0
0.960
0
2.880
1.920
In general, forward mode evaluates a transposed Jacobian–vector product
So we evaluated:
Primals: independent dependentDerivatives (adjoints): independent dependent
79
Reverse mode
In general, reverse mode evaluates a transposed Jacobian–vector product
So we evaluated:For this is the gradient
Primals: independent dependentDerivatives (adjoints): independent dependent
80
Forward vs reverse summaryIn the extremeuse forward mode to evaluate
In the extremeuse reverse mode to evaluate
81
Forward vs reverse summaryIn the extremeuse forward mode to evaluate
In the extremeuse reverse mode to evaluate
In general the Jacobian can be evaluated in- with forward mode- with reverse mode
Reverse performs better when
Backprop through normal PDF
82
83
Backprop through normal PDF
-x
µ
σ
·2 *
·2 /
*
2 π
- exp
sqrt 1/·
* f0.5
0
1
Summary
84
85
Summary
This lecture:- Derivatives in machine learning- Review of essential concepts (what is a derivative, etc.)- How do we compute derivatives- Automatic differentiation
Next lecture:- Current landscape of tools- Implementation techniques- Advanced concepts (higher-order API, checkpointing, etc.)
86
ReferencesBaydin, A.G., Pearlmutter, B.A., Radul, A.A. and Siskind, J.M., 2017. Automatic differentiation in machine learning: a survey. Journal of Machine Learning Research (JMLR), 18(153), pp.1-153.
Baydin, Atılım Güneş, Barak A. Pearlmutter, and Jeffrey Mark Siskind. 2016. “Tricks from Deep Learning.” In 7th International Conference on Algorithmic Differentiation, Christ Church Oxford, UK, September 12–15, 2016.
Baydin, Atılım Güneş, Barak A. Pearlmutter, and Jeffrey Mark Siskind. 2016. “DiffSharp: An AD Library for .NET Languages.” In 7th International Conference on Algorithmic Differentiation, Christ Church Oxford, UK, September 12–15, 2016.
Baydin, Atılım Güneş, Robert Cornish, David Martínez Rubio, Mark Schmidt, and Frank Wood. 2018. “Online Learning Rate Adaptation with Hypergradient Descent.” In Sixth International Conference on Learning Representations (ICLR), Vancouver, Canada, April 30 – May 3, 2018.
Griewank, A. and Walther, A., 2008. Evaluating derivatives: principles and techniques of algorithmic differentiation (Vol. 105). SIAM.
Nocedal, J. and Wright, S.J., 1999. Numerical Optimization. Springer.
Extra slides
87
88
Forward mode Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent
89
Forward mode
f(a, b):
c = a * b
d = log(c)
return d
log*
a
b
cd
Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent
90
Forward mode
f(a, b):
c = a * b
d = log(c)
return d
f(2, 3)
log*
a
b
cd
Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent
91
Forward mode
f(a, b):
c = a * b
d = log(c)
return d
f(2, 3)
log*
a
b
cd
2
Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent
92
Forward mode
f(a, b):
c = a * b
d = log(c)
return d
f(2, 3)
log*
a
b
cd
2
3
Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent
93
Forward mode
f(a, b):
c = a * b
d = log(c)
return d
f(2, 3)
log*
a
b
cd
2
3
1
Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent
94
Forward mode
f(a, b):
c = a * b
d = log(c)
return d
f(2, 3)
log*
a
b
cd
2
3
1
0
Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent
95
Forward mode
f(a, b):
c = a * b
d = log(c)
return d
f(2, 3)
log*
a
b
cd
2
3
1
0
6
Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent
96
Forward mode
f(a, b):
c = a * b
d = log(c)
return d
f(2, 3)
log*
a
b
cd
2
3
1
0
3
6
Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent
97
Forward mode
f(a, b):
c = a * b
d = log(c)
return d
f(2, 3)
log*
a
b
cd
2
3
1
0
3
61.791
Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent
98
Forward mode
f(a, b):
c = a * b
d = log(c)
return d
f(2, 3)
log*
a
b
cd
2
3
1
0
3
61.791
0.5
Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent
99
Forward mode
f(a, b):
c = a * b
d = log(c)
return d
f(2, 3)
log*
a
b
cd
2
3
Primals: independent 🡨 dependentDerivatives (tangents): independent 🡨 dependent
1
0
3
61.791
0.5
In general, forward mode evaluates a Jacobian–vector product
We evaluated the partial derivative with
100
Reverse mode
f(a, b):
c = a * b
d = log(c)
return d
log*
a
b
cd
Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent
101
Reverse mode
f(a, b):
c = a * b
d = log(c)
return d
f(2, 3)
log*
a
b
cd
Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent
102
Reverse mode
f(a, b):
c = a * b
d = log(c)
return d
f(2, 3)
log*
a
b
cd
Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent
2
103
Reverse mode
f(a, b):
c = a * b
d = log(c)
return d
f(2, 3)
log*
a
b
cd
Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent
2
3
104
Reverse mode
f(a, b):
c = a * b
d = log(c)
return d
f(2, 3)
log*
a
b
cd
Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent
2
3
6
105
Reverse mode
f(a, b):
c = a * b
d = log(c)
return d
f(2, 3)
log*
a
b
cd
Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent
2
3
61.791
106
Reverse mode
f(a, b):
c = a * b
d = log(c)
return d
f(2, 3)
log*
a
b
cd
Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent
2
3
61.791
1
107
Reverse mode
f(a, b):
c = a * b
d = log(c)
return d
f(2, 3)
log*
a
b
cd
Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent
2
3
61.791
1
108
Reverse mode
f(a, b):
c = a * b
d = log(c)
return d
f(2, 3)
log*
a
b
cd
Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent
2
3
61.791
10.166
109
Reverse mode
f(a, b):
c = a * b
d = log(c)
return d
f(2, 3)
log*
a
b
cd
Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent
2
3
61.791
10.166
110
Reverse mode
f(a, b):
c = a * b
d = log(c)
return d
f(2, 3)
log*
a
b
cd
Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent
2
3
61.791
10.166
0.5
111
Reverse mode
f(a, b):
c = a * b
d = log(c)
return d
f(2, 3)
log*
a
b
cd
Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent
2
3
61.791
10.166
0.5
112
Reverse mode
f(a, b):
c = a * b
d = log(c)
return d
f(2, 3)
log*
a
b
cd
Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent
2
3
61.791
10.166
0.5
0.333
113
Reverse mode
f(a, b):
c = a * b
d = log(c)
return d
f(2, 3)
log*
a
b
cd
Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent
2
3
61.791
10.166
0.5
0.333
In general, reverse mode evaluates a transposed Jacobian–vector product
We evaluated the gradient with
114
Reverse mode
f(a, b):
c = a * b
d = log(c)
return d
f(2, 3)
log*
a
b
cd
Primals: independent 🡨 dependentDerivatives (adjoints): independent 🡨 dependent
2
3
61.791
10.166
0.5
0.333
In general, reverse mode evaluates a transposed Jacobian–vector product
We evaluated the gradient with