artificial neural networkscse.iitrpr.ac.in/ckn/courses/f2019/cs618/w3.pdflinear regression and using...
TRANSCRIPT
Outline• Feedforward Networks• Architecture• Activation Functions
• Training Feedforward Networks• Gradient Descent• Automatic Differentiation• Scalar and Vector Mode• Backpropagation algorithm
• Heuristics for Training Feedforward Networks
Feedforward Networks CS618 - Artificial Neural Networks 2
Acknowledgements- Content on activation functions has been adopted from the excellent material of CS231n
(Stanford)- Training the network has been adopted from the class material of 11-785 offered by Bhiksha Raj
(CMU)- Auto Differentiation has been adopted from stat 157 course by Alex Smola and Mu Li (Berkeley)
Feedforward Networks CS618 - Artificial Neural Networks 3
Recap• Perceptron• Perceptron Update Rule
• Multilayer Perceptron• MLP is a Universal Function
Approximator
Feedforward Networks CS618 - Artificial Neural Networks 4
w1
w2
wn
w0
x1
x2
xn
x0=1
.
.
.
ΣΣ wi xi
n
i=0 1 if > 0
-1 otherwise{o =Σ wi xi
n
i=0
Machine Learning Model Paradigm• Generic Architecture• Representation
• How would you like to characterize what is being learned?• Evaluation
• How would you like to measure the goodness of what is being learned• Optimization
• Given the evaluation and characterization, find the optimum representation.
• Discuss this in the context of an MLP
Feedforward Networks CS618 - Artificial Neural Networks 5
Linear Regression and Using Nonlinear Basis Functions (1)• Linear combination of fixed
nonlinear basis functions of the input variables
𝑦 = 𝑓 𝑤% + '()*
+
𝑤(𝜙( x
• Regression• 𝑓 ̇ - identity function
• Classification• 𝑓 ̇ - nonlinear activation
Feedforward Networks CS618 - Artificial Neural Networks 6
140 3. LINEAR MODELS FOR REGRESSION
−1 0 1−1
−0.5
0
0.5
1
−1 0 10
0.25
0.5
0.75
1
−1 0 10
0.25
0.5
0.75
1
Figure 3.1 Examples of basis functions, showing polynomials on the left, Gaussians of the form (3.4) in thecentre, and sigmoidal of the form (3.5) on the right.
on a regular lattice, such as the successive time points in a temporal sequence, or thepixels in an image. Useful texts on wavelets include Ogden (1997), Mallat (1999),and Vidakovic (1999).
Most of the discussion in this chapter, however, is independent of the particularchoice of basis function set, and so for most of our discussion we shall not specifythe particular form of the basis functions, except for the purposes of numerical il-lustration. Indeed, much of our discussion will be equally applicable to the situationin which the vector φ(x) of basis functions is simply the identity φ(x) = x. Fur-thermore, in order to keep the notation simple, we shall focus on the case of a singletarget variable t. However, in Section 3.1.5, we consider briefly the modificationsneeded to deal with multiple target variables.
3.1.1 Maximum likelihood and least squaresIn Chapter 1, we fitted polynomial functions to data sets by minimizing a sum-
of-squares error function. We also showed that this error function could be motivatedas the maximum likelihood solution under an assumed Gaussian noise model. Letus return to this discussion and consider the least squares approach, and its relationto maximum likelihood, in more detail.
As before, we assume that the target variable t is given by a deterministic func-tion y(x,w) with additive Gaussian noise so that
t = y(x,w) + ϵ (3.7)
where ϵ is a zero mean Gaussian random variable with precision (inverse variance)β. Thus we can write
p(t|x,w, β) = N (t|y(x,w), β−1). (3.8)
Recall that, if we assume a squared loss function, then the optimal prediction, for anew value of x, will be given by the conditional mean of the target variable. In theSection 1.5.5case of a Gaussian conditional distribution of the form (3.8), the conditional mean
Linear Regression and Using Nonlinear Basis Functions (2)• The nonlinear basis functions were predefined• Extend it such that• 𝜙( depend on parameters 𝑤(• The parameters 𝑤( should be learned.
• Feedforward networks is one method to construct these parametric nonlinear basis functions.
Feedforward Networks CS618 - Artificial Neural Networks 7
Basic Feedforward Network Model (1)• Key difference from Multilayer Perceptron - nonlinear activation
function
Feedforward Networks CS618 - Artificial Neural Networks 8
w1
w2
wn
w0
x1
x2
xn
x0 = 1
.
.
.
Σnet = Σ wi xii=0
n1
1 + e-net
o = σ(net) =
Basic Feedforward Network Model (2)• Input - x• Connection weights between
input and hidden layer – 𝑤(/• Output of the ℎ1/ node in the
hidden layer - 𝑧/
𝑧/ = 𝑓* '()%
+
𝑤(/* 𝑥(
• 𝑓* ̇ - activation function• 𝑧/ - activations at the ℎ1/ node
Feedforward Networks CS618 - Artificial Neural Networks 9
𝑧/
𝑜* 𝑜5 𝑜6
𝑤%57
…
𝑥(
𝑤*57
𝑤(/*
𝑧% 𝑧* 𝑧9…𝑤/57 𝑤957
𝑥% 𝑥* … 𝑥+
𝑤+/*𝑤*/*𝑤%/*
=1
=1
Basic Feedforward Network Model (3)• Connection weights between
hidden and output layer 𝑣5/• Output of the 𝑘1/ output layer
node 𝑜5
𝑜5 = 𝑓7 '()%
9
𝑤(57 𝑧(
• 𝑓7 ̇ - activation of the output layer.• Identity for a regression task
Feedforward Networks CS618 - Artificial Neural Networks 10
𝑧/
𝑜* 𝑜5 𝑜6
𝑤%57
…
𝑥(
𝑤*57
𝑤(/*
𝑧% 𝑧* 𝑧9…𝑤/57 𝑤957
𝑥% 𝑥* … 𝑥+
𝑤+/*𝑤*/*𝑤%/*
=1
=1
Example: 2- Layer Feedforward Network
𝑜 = 𝑓7 '()%
9
𝑤(57 𝑓* '<)%
+
𝑤</* 𝑥<
Feedforward Networks CS618 - Artificial Neural Networks 11
𝑧/
𝑜* 𝑜5 𝑜6
𝑤%57
…
𝑥(
𝑤*57
𝑤(/*
𝑧% 𝑧* 𝑧9…𝑤/57 𝑤957
𝑥% 𝑥* … 𝑥+
𝑤+/*𝑤*/*𝑤%/*
=1
=1
Linear Feedforward Network• If the activations of all the hidden units in a network are taken to
be linear,• 𝑓/ 𝑧 = 𝑎/𝑧 + 𝑏/
• There is always an equivalent network without the hidden units• Rather, the hidden layers can be collapsed into a single unit• Composition of successive linear transformation is itself a linear
transformation• However, if the number of hidden units is less than the number of input
or output units, then there is some loss of information • Linear Feedforward Networks are of little interest to us
Feedforward Networks CS618 - Artificial Neural Networks 12
Activation Functions - Sigmoid
• f 𝑥 = **?@AB CD
• Large negative and positive inputs become 0 and 1 respectively.• Advantage• Continuous and differentiable• 𝑓E 𝑥 = 𝑓 𝑥 1 − 𝑓(𝑥)
Feedforward Networks CS618 - Artificial Neural Networks 13
Activation Functions – Sigmoid Drawbacks• Saturation – a problem during
the learning/training phase• Gradient at the tails are 0
• Outputs are not zero centered • Are always positive• Has issues during gradient
update (all positive or negative updates)
Feedforward Networks CS618 - Artificial Neural Networks 14
Activation Functions - tanh• 𝑓 𝑥 = 2𝜎 2𝑥 − 1• Large negative and positive
inputs become -1 and 1 respectively.• Advantage• Continuous and differentiable• 𝑓E 𝑥 = 1 − 𝑓7 𝑥• Zero-centered output
Feedforward Networks CS618 - Artificial Neural Networks 15
Activation Functions - ReLU• Rectified Linear Unit• 𝑓 𝑥 = max 0, 𝑥• Activation is simply
thresholded at 0.• Advantage• Computationally more efficient• non-saturation• Sparse activation
Feedforward Networks CS618 - Artificial Neural Networks 16
Activation Functions - ReLU• Disadvantages• Non-zero centered• Unbounded• Dying ReLU – units pushed into
states in which they become inactive for all inputs!• Observed when the learning rate
is too high• Not differentiable at 0, instead
use subgradients
Feedforward Networks CS618 - Artificial Neural Networks 17
Diversion: Subgradients (1)• A subgradient of a function 𝑓:ℛ+ → ℛ is any value (vector) 𝑔 ∈ ℛ+ such that
∀𝑥, 𝑦 ∈ 𝐷 𝑓 , 𝑓 𝑦≥ 𝑓 𝑥 + 𝑔X 𝑦 − 𝑥
• It is a generalization of gradient for all functions• If 𝑓 is differentiable then there
is a unique subgradient ∀𝑥 ∈𝐷 𝑓 , 𝑔 = ∇𝑓 𝑥
• Example
Feedforward Networks CS618 - Artificial Neural Networks 18
Diversion: Subgradients (2)
Feedforward Networks CS618 - Artificial Neural Networks 19
• Can use any subgradient• At the differentiable points on the curve, this is the same as the gradient• Typically, will use the equation given
𝑓′(𝑧) = 1
𝑓′(𝑧) = 0 𝑧
𝑓′(𝑧) = [0, 𝑧 < 01, 𝑧 ≥ 0
Activation Functions – Leaky ReLU• 𝑓 𝑥 = 𝕀 𝑥 < 0 𝛼𝑥 +𝕀 𝑥 ≥ 0 𝑥• 𝛼 is a small constant
• Advantage• Reduces the dying unit problem
• Results are not consistent
Feedforward Networks CS618 - Artificial Neural Networks 20
f (y) = y
y
f (y)
f (y) = y
f (y) = ay
y
f (y)
f (y) = 0
Figure 1. ReLU vs. PReLU. For PReLU, the coefficient of thenegative part is not constant and is adaptively learned.
2. ApproachIn this section, we first present the PReLU activation
function (Sec. 2.1). Then we derive our initializationmethod for deep rectifier networks (Sec. 2.2). Lastly wediscuss our architecture designs (Sec. 2.3).
2.1. Parametric RectifiersWe show that replacing the parameter-free ReLU activa-
tion by a learned parametric activation unit improves clas-sification accuracy1.
Definition
Formally, we consider an activation function defined as:
f(yi) =
(yi, if yi > 0
aiyi, if yi 0. (1)
Here yi is the input of the nonlinear activation f on the ithchannel, and ai is a coefficient controlling the slope of thenegative part. The subscript i in ai indicates that we allowthe nonlinear activation to vary on different channels. Whenai = 0, it becomes ReLU; when ai is a learnable parameter,we refer to Eqn.(1) as Parametric ReLU (PReLU). Figure 1shows the shapes of ReLU and PReLU. Eqn.(1) is equiva-lent to f(yi) = max(0, yi) + ai min(0, yi).
If ai is a small and fixed value, PReLU becomes theLeaky ReLU (LReLU) in [20] (ai = 0.01). The motiva-tion of LReLU is to avoid zero gradients. Experiments in[20] show that LReLU has negligible impact on accuracycompared with ReLU. On the contrary, our method adap-tively learns the PReLU parameters jointly with the wholemodel. We hope for end-to-end training that will lead tomore specialized activations.
PReLU introduces a very small number of extra param-eters. The number of extra parameters is equal to the totalnumber of channels, which is negligible when consideringthe total number of weights. So we expect no extra riskof overfitting. We also consider a channel-shared variant:
1Concurrent with our work, Agostinelli et al. [1] also investigatedlearning activation functions and showed improvement on other tasks.
f(yi) = max(0, yi) + amin(0, yi) where the coefficient isshared by all channels of one layer. This variant only intro-duces a single extra parameter into each layer.
Optimization
PReLU can be trained using backpropagation [17] and opti-mized simultaneously with other layers. The update formu-lations of {ai} are simply derived from the chain rule. Thegradient of ai for one layer is:
@E@ai
=X
yi
@E@f(yi)
@f(yi)
@ai, (2)
where E represents the objective function. The term @E@f(yi)
is the gradient propagated from the deeper layer. The gradi-ent of the activation is given by:
@f(yi)
@ai=
(0, if yi > 0
yi, if yi 0. (3)
The summationP
yiruns over all positions of the feature
map. For the channel-shared variant, the gradient of a is@E@a =
Pi
Pyi
@E@f(yi)
@f(yi)@a , where
Pi sums over all chan-
nels of the layer. The time complexity due to PReLU isnegligible for both forward and backward propagation.
We adopt the momentum method when updating ai:
�ai := µ�ai + ✏@E@ai
. (4)
Here µ is the momentum and ✏ is the learning rate. It isworth noticing that we do not use weight decay (l2 regular-ization) when updating ai. A weight decay tends to push aito zero, and thus biases PReLU toward ReLU. Even withoutregularization, the learned coefficients rarely have a magni-tude larger than 1 in our experiments. Further, we do notconstrain the range of ai so that the activation function maybe non-monotonic. We use ai = 0.25 as the initializationthroughout this paper.
Comparison Experiments
We conducted comparisons on a deep but efficient modelwith 14 weight layers. The model was studied in [10](model E of [10]) and its architecture is described in Ta-ble 1. We choose this model because it is sufficient for rep-resenting a category of very deep models, as well as to makethe experiments feasible.
As a baseline, we train this model with ReLU appliedin the convolutional (conv) layers and the first two fully-connected (fc) layers. The training implementation follows[10]. The top-1 and top-5 errors are 33.82% and 13.34% onImageNet 2012, using 10-view testing (Table 2).
2
Activation Functions - Summary• Never use Sigmoid, prefer Tanh over sigmoid• Better choice would be ReLU or Leaky ReLU
Feedforward Networks CS618 - Artificial Neural Networks 21
Skip Connections• Generic Feedforward
networks can have skip connections• Connections that skip a layer• In principle, networks with
sigmoidal hidden units can always mimic skip connections
230 5. NEURAL NETWORKS
Figure 5.2 Example of a neural network having ageneral feed-forward topology. Note thateach hidden and output unit has anassociated bias parameter (omitted forclarity).
x1
x2
z1
z3
z2
y1
y2
inputs outputs
instance, in a two-layer network these would go directly from inputs to outputs. Inprinciple, a network with sigmoidal hidden units can always mimic skip layer con-nections (for bounded input values) by using a sufficiently small first-layer weightthat, over its operating range, the hidden unit is effectively linear, and then com-pensating with a large weight value from the hidden unit to the output. In practice,however, it may be advantageous to include skip-layer connections explicitly.
Furthermore, the network can be sparse, with not all possible connections withina layer being present. We shall see an example of a sparse network architecture whenwe consider convolutional neural networks in Section 5.5.6.
Because there is a direct correspondence between a network diagram and itsmathematical function, we can develop more general network mappings by con-sidering more complex network diagrams. However, these must be restricted to afeed-forward architecture, in other words to one having no closed directed cycles, toensure that the outputs are deterministic functions of the inputs. This is illustratedwith a simple example in Figure 5.2. Each (hidden or output) unit in such a networkcomputes a function given by
zk = h
!"
j
wkjzj
#(5.10)
where the sum runs over all units that send connections to unit k (and a bias param-eter is included in the summation). For a given set of values applied to the inputs ofthe network, successive application of (5.10) allows the activations of all units in thenetwork to be evaluated including those of the output units.
The approximation properties of feed-forward networks have been widely stud-ied (Funahashi, 1989; Cybenko, 1989; Hornik et al., 1989; Stinchecombe and White,1989; Cotter, 1990; Ito, 1991; Hornik, 1991; Kreinovich, 1991; Ripley, 1996) andfound to be very general. Neural networks are therefore said to be universal ap-proximators. For example, a two-layer network with linear outputs can uniformlyapproximate any continuous function on a compact input domain to arbitrary accu-racy provided the network has a sufficiently large number of hidden units. This resultholds for a wide range of hidden unit activation functions, but excluding polynomi-als. Although such theorems are reassuring, the key problem is how to find suitableparameter values given a set of training data, and in later sections of this chapter we
Feedforward Networks CS618 - Artificial Neural Networks 22
Weight Space Symmetries (1)• Multiple distinct choices for the weight vector can all give rise to
the same mapping function from the inputs to the outputs
Feedforward Networks CS618 - Artificial Neural Networks 23
𝑧/
𝑜* 𝑜5 𝑜6
𝑤%57
…
𝑥(
𝑤*57
𝑤(/*
𝑧% 𝑧* 𝑧9…𝑤/57 𝑤957
𝑥% 𝑥* … 𝑥+
𝑤+/*𝑤*/*𝑤%/*
=1
=1
𝑧*
𝑜* 𝑜5 𝑜6
𝑤%57
…
𝑥(
𝑤*57
𝑤(/*
𝑧% 𝑧/ 𝑧9…𝑤/57 𝑤957
𝑥% 𝑥* … 𝑥+
𝑤+/*𝑤*/*𝑤%/*
=1
=1
Weight Space Symmetries (2)• Multiple distinct choices for the weight vector can all give rise to
the same mapping function from the inputs to the outputs• Even though the mapping is the same, the weight vectors are
different.• Given 𝐻 hidden units, there are 𝐻! different orderings
Feedforward Networks CS618 - Artificial Neural Networks 24
Machine Learning Model Paradigm• Generic Architecture• Representation
• How would you like to characterize what is being learned?• Evaluation
• How would you like to measure the goodness of what is being learned• Optimization
• Given the evaluation and characterization, find the optimum representation.
Feedforward Networks CS618 - Artificial Neural Networks 25
Problem Statement• Given a training set of input and output pairs xa, 𝑦< <)*
b
• We have parameterized the function 𝑓:ℛ+ → 𝒴 as the neural network.• Parameters are 𝑤(/d - weights of the neural network
• The next step would be define an evaluation function ℒ in terms of the parameters of the network.
ℒ 𝐖 = ⋯
Feedforward Networks CS618 - Artificial Neural Networks 26
Representing the Input
• Vectors of numbers • (or may even be just a scalar, if input layer is of size 1)• E.g. vector of pixel values• E.g. vector of speech features• E.g. real-valued vector representing text• Other real valued vectors
Feedforward Networks CS618 - Artificial Neural Networks 27
InputLayer Output
Layer
Hidden Layers
Representing the Output (Regression)
• If the desired output is real-valued• Scalar Output : single output neuron - 𝑜 without any activation• Vector Output : as many output neurons as the dimension of the
desired output – o = [𝑜*, 𝑜7, . . , 𝑜6]
Feedforward Networks CS618 - Artificial Neural Networks 28
Representing the Output (Classification) (1)• If the desired output is binary (0/1 or -1/+1)• Sigmoid Output : single output neuron - 𝑜 with sigmoid activation• Viewed as the probability 𝑃 𝑌 = 1|𝑋 of class value 1
• Indicating the fact that for actual data, in general a feature value x may occur in both classes, but with different probabilities
• Is differentiable
Feedforward Networks CS618 - Artificial Neural Networks 29
Representing the Output (Classification) (2)• Consider a network that must distinguish the input as one of the
following classes – dog, cat, cow, or horse• We can represent this set as the following vector: dog cat cow horse X
• For inputs of each of the four classes the desired output is: • dog: 1 0 0 0 X, cat: 0 1 0 0 X, cow: 0 0 1 0 X, horse: 0 0 0 1 X
• For an input of any class, we will have a four-dimensional vector output with three zeros and a single 1 at the position of that class• This is a one hot vector representation
Feedforward Networks CS618 - Artificial Neural Networks 30
Representing the Output (Classification) (3)• For a multi-class classifier with 𝐾 classes, the one-hot
representation will have 𝐾 binary outputs• The neural network’s output too must ideally be binary (𝐾 − 1
zeros and a single 1 in the right place)
• More realistically, it will be a probability vector of 𝐾 dimensions• 𝐾 probability values that sum to 1.
Feedforward Networks CS618 - Artificial Neural Networks 31
Representing the Output (Classification) (4)• Softmax vector activation is often used at the output of multi-
class classifier nets𝑧5z = w5
z. zzC*
|𝑦5 =exp 𝑧5z
∑( exp 𝑧(z
• This can be viewed as the probability |𝑦5 = 𝑃(class = 𝑘|x)
Feedforward Networks CS618 - Artificial Neural Networks 32
Representing the Output (Classification) (5)• Consider a network that must identify all/some of the objects
present in an image• We can represent this set as the following vector: dog cat cow horse X
• Each class is however treated independently. (no longer a one-hot encoding)• Output activations will be individual Sigmoids (not softmax)
Feedforward Networks CS618 - Artificial Neural Networks 33
Defining the Evaluation Function• Given a training set of input and output pairs xa, 𝑦< <)*
b
• The feedforward pass of the network provides the outputs |𝑦< <)*
b
• We can now define the evaluation function ℒ 𝐖
Feedforward Networks CS618 - Artificial Neural Networks 34
Evaluation Functions - 𝐿7 divergence
• For real-valued output vectors, the 𝐿7 divergence is popular
ℓ W; x, y =12|y − y 7 =
12'()*
+
|𝑦( − 𝑦(7
• Squared Euclidean distance between true and desired output• Note: this is differentiable
𝑑ℓ W; x, y𝑑 |𝑦(
= �𝑦( − 𝑦(
∇�ℒ W = |𝑦* − 𝑦*, |𝑦7 − 𝑦7, … , |𝑦6 − 𝑦6 X
Feedforward Networks CS618 - Artificial Neural Networks 35
Evaluation Functions – Cross Entropy Error• For binary classifier, the actual output 𝑦 ∈ 0,1• This can be interpreted as true probability distribution 𝑦, 1 − 𝑦
• The predicted probability distribution is [ |𝑦, 1 − |𝑦]
• Then, we can estimate the Kullback-Leibler (KL) divergence between the two distributions
ℓ W; x, y = −𝑦 log |𝑦 − 1 − 𝑦 log(1 − |𝑦)• Minimum occurs when |𝑦 = 𝑦
• Derivative = ?
Feedforward Networks CS618 - Artificial Neural Networks 36
𝑑ℓ W; x, y𝑑 |𝑦 =
−1|𝑦 𝑖𝑓 𝑦 = 1
11 − |𝑦 𝑖𝑓 𝑦 = 0
Evaluation Functions – Softmax Loss• Desired output y is a one hot vector 0 0…1 …0 0 0 with the 1 in
the position corresponding to the true class• Actual output will be probability distribution |𝑦• The KL divergence between the desired one-hot output and actual
output:
ℓ W; x, y = −'5
𝑦5 log |𝑦5 = − log |𝑦5
• Derivative - ?
Feedforward Networks CS618 - Artificial Neural Networks 37
𝑑ℓ W; x, y𝑑 |𝑦5
= �−1|𝑦5
for the 𝑘1/ component
0 for remaining component
𝛻��ℓ W; x, y = 0 0 …−1|𝑦5…0 0
So Far• Representation• How would you like to characterize what is being learned? – W – neural
network parameters• Evaluation• How would you like to measure the goodness of what is being learned -ℒ W - evaluation functions
• Optimization• Given the evaluation and characterization, find the optimum
representation• Solution – Gradient Descent!
Feedforward Networks CS618 - Artificial Neural Networks 38
Training Feedforward Networks Through Gradient Descent• Given a training set of input and output pairs xa, 𝑦< <)*
b
• We have parameterized the function 𝑓:ℛ+ → 𝒴 as the neural network.• Parameters are 𝑤(/d - weights of the neural network
• An evaluation function ℒ in terms of the parameters of the network to measure the goodness of the function• Minimize ℒ W with respect to 𝑤(/d
Feedforward Networks CS618 - Artificial Neural Networks 39
Quick Recap – Gradient Descent• In order to minimize any function 𝑓(x) w.r.t. x• Initialize: • x%• 𝑡 = 0
• While |𝑓(x1?* ) − 𝑓(x1 )| > 𝜀• x1?* = x1 − 𝜂1 𝛻𝑓 x1• 𝑡 = 𝑡 + 1
Feedforward Networks CS618 - Artificial Neural Networks 40
Training Feedforward Networks Through Gradient Descent• Total Training Error - ℒ W = *
b∑<)*b ℓ W; x<, ya
• Initialize all weights – 𝑤(/d
• Do:• For every layer 𝑙, for all 𝑗, ℎ update:
𝑤(/d ← 𝑤(/d − 𝜂𝑑ℒ W𝑑𝑤(/
d
• Until ℒ W has converged
Feedforward Networks CS618 - Artificial Neural Networks 41
Training Feedforward Networks Through Gradient Descent• Total Training Error - ℒ W = *
b∑<)*b ℓ W; x<, ya
• Initialize all weights – 𝑤(/d
• Do:• For every layer 𝑙, for all 𝑗, ℎ update:
𝑤(/d ← 𝑤(/d − 𝜂𝑑ℒ W𝑑𝑤(/
d
• Until ℒ W has converged• So how to compute �ℒ �
�����
Feedforward Networks CS618 - Artificial Neural Networks 42
Key Principle – Chain Rule of Differentiation
𝑦 = 𝑓 𝑔* 𝑥 , 𝑔* 𝑥 , … , 𝑔� 𝑥• Then,
𝑑𝑦𝑑𝑥
=𝜕𝑓
𝜕𝑔*(𝑥)𝑑𝑔*(𝑥)𝑑𝑥
+𝜕𝑓
𝜕𝑔7(𝑥)𝑑𝑔7(𝑥)𝑑𝑥
+⋯+𝜕𝑓
𝜕𝑔�(𝑥)𝑑𝑔�(𝑥)𝑑𝑥
Feedforward Networks CS618 - Artificial Neural Networks 43
Digression – Automatic Differentiation (1)• Automatic Differentiation• No need for explicit form of the derivative
• Difficult to compute it for extremely complex functions• Better than numerical differentiation
• Estimate and not accurate
Feedforward Networks CS618 - Artificial Neural Networks 45
Digression – Automatic Differentiation (2)Computation Graph
Feedforward Networks CS618 - Artificial Neural Networks 46
Digression – Automatic Differentiation (3)Two Modes
Feedforward Networks CS618 - Artificial Neural Networks 47
Digression – Automatic Differentiation (4)Reverse Accumulation
Feedforward Networks CS618 - Artificial Neural Networks 48
Digression – Automatic Differentiation (5)Reverse Accumulation
Feedforward Networks CS618 - Artificial Neural Networks 49
Digression – Automatic Differentiation (6)Reverse Accumulation
Feedforward Networks CS618 - Artificial Neural Networks 50
Digression – Automatic Differentiation (7)Reverse Accumulation
Feedforward Networks CS618 - Artificial Neural Networks 51
Digression – Automatic Differentiation (8)Reverse Accumulation
Feedforward Networks CS618 - Artificial Neural Networks 52
• Build a computation graph• Forward: Evaluate the graph and store the intermediate results• Backward: Evaluate the graph in a reversed order• Eliminate paths not needed
• Computational Complexity: Similar to the forward pass• Memory Complexity High – as the intermediate results need to
be stored.
Applying Forward and Backward Pass to a Feedforward Network
Feedforward Networks CS618 - Artificial Neural Networks 53
𝑓z
𝑓z⋯
𝑓zC*
yz = |yzz
yzC*zzC*
𝑓*
y*z1
𝑦% = x
𝑓*
𝑓*
𝑓*
1𝑓7
y7z2
𝑓7
𝑓7
𝑓7
1𝑓�
y�z�
𝑓�
𝑓�
𝑓�
1
𝑓zC*
𝑓zC*
𝑓zC*
1
Generic Forward Pass (1)
Feedforward Networks CS618 - Artificial Neural Networks 54
𝑦% = x
𝑓z
𝑓z⋯
𝑓zC*
yz = |yzz
yzC*zzC*
𝑓*
y*z1
𝑓*
𝑓*
𝑓*
1𝑓7
y7z2
𝑓7
𝑓7
𝑓7
1𝑓�
y�z�
𝑓�
𝑓�
𝑓�
1
𝑓zC*
𝑓zC*
𝑓zC*
1
Generic Forward Pass (2)
Feedforward Networks CS618 - Artificial Neural Networks 55
𝑓z
𝑓z⋯
yz = |yzz
yzC*zzC*y*z1
𝑦% = x
1
y7z2
1
y�z�
1 1
𝑧(* ='<
𝑤<(* 𝑦<%
𝑓zC*
𝑓*
𝑓*
𝑓*
𝑓*
𝑓7
𝑓7
𝑓7
𝑓7
𝑓�
𝑓�
𝑓�
𝑓�
𝑓zC*
𝑓zC*
𝑓zC*
Generic Forward Pass (3)
Feedforward Networks CS618 - Artificial Neural Networks 56
𝑓z
𝑓z⋯
yz = |yzz
yzC*zzC*y*z1
𝑦% = x
1
y7z2
1
y�z�
1 1
𝑧(* ='<
𝑤<(* 𝑦<% 𝑦(* = 𝑓* 𝑧(*
𝑓zC*
𝑓*
𝑓*
𝑓*
𝑓*
𝑓7
𝑓7
𝑓7
𝑓7
𝑓�
𝑓�
𝑓�
𝑓�
𝑓zC*
𝑓zC*
𝑓zC*
Generic Forward Pass (4)
Feedforward Networks CS618 - Artificial Neural Networks 57
𝑓z
𝑓z⋯
yz = |yzz
yzC*zzC*y*z1
𝑦% = x
1
y7z2
1
y�z�
1 1
𝑧(* ='<
𝑤<(* 𝑦<% 𝑦(* = 𝑓* 𝑧(* 𝑧(7 ='<
𝑤<(7𝑦<*
𝑓zC*
𝑓*
𝑓*
𝑓*
𝑓*
𝑓7
𝑓7
𝑓7
𝑓7
𝑓�
𝑓�
𝑓�
𝑓�
𝑓zC*
𝑓zC*
𝑓zC*
Generic Forward Pass (5)
Feedforward Networks CS618 - Artificial Neural Networks 58
𝑓z
𝑓z⋯
yz = |yzz
yzC*zzC*y*z1
𝑦% = x
1
y7z2
1
y�z�
1 1
𝑧(* ='<
𝑤<(* 𝑦<% 𝑦(* = 𝑓* 𝑧(* 𝑧(7 ='<
𝑤<(7𝑦<* 𝑦(7 = 𝑓7 𝑧(7
𝑓zC*
𝑓*
𝑓*
𝑓*
𝑓*
𝑓7
𝑓7
𝑓7
𝑓7
𝑓�
𝑓�
𝑓�
𝑓�
𝑓zC*
𝑓zC*
𝑓zC*
Generic Forward Pass (6)
Feedforward Networks CS618 - Artificial Neural Networks 59
𝑓z
𝑓z⋯
yz = |yzz
yzC*zzC*y*z1
𝑦% = x
1
y7z2
1
y�z�
1 1
𝑧(* ='<
𝑤<(* 𝑦<% 𝑦(* = 𝑓* 𝑧(* 𝑧(7 ='<
𝑤<(7𝑦<* 𝑦(7 = 𝑓7 𝑧(7 𝑧(� ='<
𝑤<(�𝑦<7
𝑓zC*
𝑓*
𝑓*
𝑓*
𝑓*
𝑓7
𝑓7
𝑓7
𝑓7
𝑓�
𝑓�
𝑓�
𝑓�
𝑓zC*
𝑓zC*
𝑓zC*
Generic Forward Pass (7)
Feedforward Networks CS618 - Artificial Neural Networks 60
𝑓z
𝑓z⋯
yz = |yzz
yzC*zzC*y*z1
𝑦% = x
1
y7z2
1
y�z�
1 1
𝑧(* ='<
𝑤<(* 𝑦<% 𝑦(* = 𝑓* 𝑧(* 𝑧(7 ='<
𝑤<(7𝑦<* 𝑦(7 = 𝑓7 𝑧(7 𝑧(� ='<
𝑤<(�𝑦<7 𝑦(� = 𝑓� 𝑧(� ⋯
𝑓zC*
𝑓*
𝑓*
𝑓*
𝑓*
𝑓7
𝑓7
𝑓7
𝑓7
𝑓�
𝑓�
𝑓�
𝑓�
𝑓zC*
𝑓zC*
𝑓zC*
Generic Forward Pass (8)
Feedforward Networks CS618 - Artificial Neural Networks 61
𝑓z
𝑓z⋯
yz = |yzz
yzC*zzC*y*z1
𝑦% = x
1
y7z2
1
y�z�
1 1
𝑧(z ='<
𝑤<(z 𝑦<zC* 𝑦(z = 𝑓z 𝑧(z𝑦(zC* = 𝑓zC* 𝑧(zC*
𝑓zC*
𝑓*
𝑓*
𝑓*
𝑓*
𝑓7
𝑓7
𝑓7
𝑓7
𝑓�
𝑓�
𝑓�
𝑓�
𝑓zC*
𝑓zC*
𝑓zC*
Generic Forward Pass (9)
Feedforward Networks CS618 - Artificial Neural Networks 62
𝑓z
𝑓z⋯
yz = |yzz
yzC*zzC*y*z1
𝑦% = x
1
y7z2
1
y�z�
1 1
𝑧(d = '<
𝑤<(d 𝑦<dC* 𝑦(dC* = 𝑓dC* 𝑧(dC*y% = x
𝑓zC*
𝑓*
𝑓*
𝑓*
𝑓*
𝑓7
𝑓7
𝑓7
𝑓7
𝑓�
𝑓�
𝑓�
𝑓�
𝑓zC*
𝑓zC*
𝑓zC*
Backward Pass – Computing Derivatives (1)
Feedforward Networks CS618 - Artificial Neural Networks 63
𝑓z
𝑓z⋯
yz = |yzz
yzC*zzC*y*z1
𝑦% = x
1
y7z2
1
y�z�
1 1
All the intermediate values estimated in the forward pass must be stored – we will need them to compute the derivatives
𝑓zC*
𝑓*
𝑓*
𝑓*
𝑓*
𝑓7
𝑓7
𝑓7
𝑓7
𝑓�
𝑓�
𝑓�
𝑓�
𝑓zC*
𝑓zC*
𝑓zC*
Backward Pass – Computing Derivatives (2)
Feedforward Networks CS618 - Artificial Neural Networks 64
𝑓z
𝑓z⋯
yz = |yzz
yzC*zzC*y*z1
𝑦% = x
1
y7z2
1
y�z�
1 1
ℒ 𝑊
y
𝑓zC*
𝑓*
𝑓*
𝑓*
𝑓*
𝑓7
𝑓7
𝑓7
𝑓7
𝑓�
𝑓�
𝑓�
𝑓�
𝑓zC*
𝑓zC*
𝑓zC*
Backward Pass – Computing Derivatives (3)
Feedforward Networks CS618 - Artificial Neural Networks 65
𝑓z
𝑓z⋯
yz = |yzz
yzC*zzC*y*z1
𝑦% = x
1
y7z2
1
y�z�
1 1
ℒ 𝑊
y
𝜕ℒ 𝑊𝜕 |𝑦<
=𝜕ℒ 𝑊𝜕𝑦<z
𝑓zC*
𝑓*
𝑓*
𝑓*
𝑓*
𝑓7
𝑓7
𝑓7
𝑓7
𝑓�
𝑓�
𝑓�
𝑓�
𝑓zC*
𝑓zC*
𝑓zC*
Backward Pass – Computing Derivatives (4)
Feedforward Networks CS618 - Artificial Neural Networks 66
𝑓z
𝑓z⋯
yz = |yzz
yzC*zzC*y*z1
𝑦% = x
1
y7z2
1
y�z�
1 1
ℒ 𝑊
y
𝜕ℒ 𝑊𝜕 𝑧*z
=𝜕ℒ 𝑊𝜕𝑦*z
𝜕𝑦*z
𝜕𝑧*z
Already computed
𝑓zC*
𝑓*
𝑓*
𝑓*
𝑓*
𝑓7
𝑓7
𝑓7
𝑓7
𝑓�
𝑓�
𝑓�
𝑓�
𝑓zC*
𝑓zC*
𝑓zC*
Backward Pass – Computing Derivatives (5)
Feedforward Networks CS618 - Artificial Neural Networks 67
𝑓z
𝑓z⋯
yz = |yzz
yzC*zzC*y*z1
𝑦% = x
1
y7z2
1
y�z�
1 1
ℒ 𝑊
y
𝜕ℒ 𝑊𝜕 𝑧*z
=𝜕ℒ 𝑊𝜕𝑦*z
𝜕𝑦*z
𝜕𝑧*z
Derivative of the activation function 𝑓z 𝑧*z
𝑓zC*
𝑓*
𝑓*
𝑓*
𝑓*
𝑓7
𝑓7
𝑓7
𝑓7
𝑓�
𝑓�
𝑓�
𝑓�
𝑓zC*
𝑓zC*
𝑓zC*
Backward Pass – Computing Derivatives (6)
Feedforward Networks CS618 - Artificial Neural Networks 68
𝑓z
𝑓z⋯
yz = |yzz
yzC*zzC*y*z1
𝑦% = x
1
y7z2
1
y�z�
1 1
ℒ 𝑊
y
𝜕ℒ 𝑊𝜕 𝑧<z
=𝜕ℒ 𝑊𝜕𝑦<z
𝜕𝑦<z
𝜕𝑧<z
𝑓zC*
𝑓*
𝑓*
𝑓*
𝑓*
𝑓7
𝑓7
𝑓7
𝑓7
𝑓�
𝑓�
𝑓�
𝑓�
𝑓zC*
𝑓zC*
𝑓zC*
Backward Pass – Computing Derivatives (7)
Feedforward Networks CS618 - Artificial Neural Networks 69
𝑓z
𝑓z⋯
yz = |yzz
yzC*zzC*y*z1
𝑦% = x
1
y7z2
1
y�z�
1 1
ℒ 𝑊
y
𝜕ℒ 𝑊𝜕 𝑤**z
=𝜕ℒ 𝑊𝜕 𝑧*z
𝜕𝑧*z
𝜕𝑤**zComputed in the previous step
𝑓zC*
𝑓*
𝑓*
𝑓*
𝑓*
𝑓7
𝑓7
𝑓7
𝑓7
𝑓�
𝑓�
𝑓�
𝑓�
𝑓zC*
𝑓zC*
𝑓zC*
Backward Pass – Computing Derivatives (8)
Feedforward Networks CS618 - Artificial Neural Networks 70
𝑓z
𝑓z⋯
yz = |yzz
yzC*zzC*y*z1
𝑦% = x
1
y7z2
1
y�z�
1 1
ℒ 𝑊
y
𝜕ℒ 𝑊𝜕 𝑤**z
=𝜕ℒ 𝑊𝜕 𝑧*z
𝜕𝑧*z
𝜕𝑤**z
𝑧*z = 𝑤**z 𝑦*zC* + ⋯
𝜕𝑧*z
𝜕𝑤**z= 𝑦*zC*
𝑓zC*
𝑓*
𝑓*
𝑓*
𝑓*
𝑓7
𝑓7
𝑓7
𝑓7
𝑓�
𝑓�
𝑓�
𝑓�
𝑓zC*
𝑓zC*
𝑓zC*
Backward Pass – Computing Derivatives (9)
Feedforward Networks CS618 - Artificial Neural Networks 71
𝑓z
𝑓z⋯
yz = |yzz
yzC*zzC*y*z1
𝑦% = x
1
y7z2
1
y�z�
1 1
ℒ 𝑊
y
𝜕ℒ 𝑊𝜕𝑤<(z
=𝜕ℒ 𝑊𝜕𝑧(z
𝜕𝑧*z
𝜕𝑤<(z= yazC*
𝜕ℒ 𝑊𝜕𝑧(z
𝑓zC*
𝑓*
𝑓*
𝑓*
𝑓*
𝑓7
𝑓7
𝑓7
𝑓7
𝑓�
𝑓�
𝑓�
𝑓�
𝑓zC*
𝑓zC*
𝑓zC*
Backward Pass – Computing Derivatives (10)
Feedforward Networks CS618 - Artificial Neural Networks 72
𝑓z
𝑓z⋯
yz = |yzz
yzC*zzC*y*z1
𝑦% = x
1
y7z2
1
y�z�
1 1
ℒ 𝑊
y
𝜕ℒ 𝑊𝜕𝑦*zC*
='(
𝜕ℒ 𝑊𝜕𝑧(zC*
𝜕𝑧(zC*
𝜕𝑦*zC*Computed a couple of steps earlier
𝑓zC*
𝑓*
𝑓*
𝑓*
𝑓*
𝑓7
𝑓7
𝑓7
𝑓7
𝑓�
𝑓�
𝑓�
𝑓�
𝑓zC*
𝑓zC*
𝑓zC*
Backward Pass – Computing Derivatives (11)
Feedforward Networks CS618 - Artificial Neural Networks 73
ℒ 𝑊
y
𝜕ℒ 𝑊𝜕𝑦*zC*
='(
𝜕ℒ 𝑊𝜕𝑧(zC*
𝜕𝑧(zC*
𝜕𝑦*zC*
𝑧(z = 𝑤*(z 𝑦*zC* + ⋯
𝜕𝑧(z
𝜕𝑦*zC*= 𝑤*(zC*
𝑓z
𝑓z⋯
yz = |yzz
yzC*zzC*y*z1
𝑦% = x
1
y7z2
1
y�z�
1 1
𝑓zC*
𝑓*
𝑓*
𝑓*
𝑓*
𝑓7
𝑓7
𝑓7
𝑓7
𝑓�
𝑓�
𝑓�
𝑓�
𝑓zC*
𝑓zC*
𝑓zC*
Backward Pass – Computing Derivatives (12)
Feedforward Networks CS618 - Artificial Neural Networks 74
ℒ 𝑊
y
𝜕ℒ 𝑊𝜕𝑦*zC*
='(
𝑤*(zC*𝜕ℒ 𝑊𝜕𝑧(zC*
𝑓z
𝑓z⋯
yz = |yzz
yzC*zzC*y*z1
𝑦% = x
1
y7z2
1
y�z�
1 1
𝑓zC*
𝑓*
𝑓*
𝑓*
𝑓*
𝑓7
𝑓7
𝑓7
𝑓7
𝑓�
𝑓�
𝑓�
𝑓�
𝑓zC*
𝑓zC*
𝑓zC*
Backward Pass – Computing Derivatives (13)
Feedforward Networks CS618 - Artificial Neural Networks 75
ℒ 𝑊
y
𝜕ℒ 𝑊𝜕𝑦<zC*
='(
𝑤<(zC*𝜕ℒ 𝑊𝜕𝑧(zC*
𝑓z
𝑓z⋯
yz = |yzz
yzC*zzC*y*z1
𝑦% = x
1
y7z2
1
y�z�
1 1
𝑓zC*
𝑓*
𝑓*
𝑓*
𝑓*
𝑓7
𝑓7
𝑓7
𝑓7
𝑓�
𝑓�
𝑓�
𝑓�
𝑓zC*
𝑓zC*
𝑓zC*
Backward Pass – Computing Derivatives (14)
Feedforward Networks CS618 - Artificial Neural Networks 76
ℒ 𝑊
y
𝜕ℒ 𝑊𝜕𝑧<zC*
=𝜕𝑦<zC*
𝜕𝑧<zC*𝜕ℒ 𝑊𝜕𝑦<zC*
𝑓z
𝑓z⋯
yz = |yzz
yzC*zzC*y*z1
𝑦% = x
1
y7z2
1
y�z�
1 1
𝑓zC*
𝑓*
𝑓*
𝑓*
𝑓*
𝑓7
𝑓7
𝑓7
𝑓7
𝑓�
𝑓�
𝑓�
𝑓�
𝑓zC*
𝑓zC*
𝑓zC*
Backward Pass – Computing Derivatives (15)
Feedforward Networks CS618 - Artificial Neural Networks 77
ℒ 𝑊
y
𝜕ℒ 𝑊𝜕𝑦<�
='(
𝑤<( 𝜕ℒ 𝑊𝜕𝑧(
𝑓z
𝑓z⋯
yz = |yzz
yzC*zzC*y*z1
𝑦% = x
1
y7z2
1
y�z�
1 1
𝑓zC*
𝑓*
𝑓*
𝑓*
𝑓*
𝑓7
𝑓7
𝑓7
𝑓7
𝑓�
𝑓�
𝑓�
𝑓�
𝑓zC*
𝑓zC*
𝑓zC*
Backward Pass – Computing Derivatives (16)
Feedforward Networks CS618 - Artificial Neural Networks 78
ℒ 𝑊
y
𝜕ℒ 𝑊𝜕𝑧<�
=𝜕𝑦<�
𝜕𝑧<�𝜕ℒ 𝑊𝜕𝑦<�
𝑓z
𝑓z⋯
yz = |yzz
yzC*zzC*y*z1
𝑦% = x
1
y7z2
1
y�z�
1 1
𝑓zC*
𝑓*
𝑓*
𝑓*
𝑓*
𝑓7
𝑓7
𝑓7
𝑓7
𝑓�
𝑓�
𝑓�
𝑓�
𝑓zC*
𝑓zC*
𝑓zC*
Backward Pass – Computing Derivatives (17)
Feedforward Networks CS618 - Artificial Neural Networks 79
ℒ 𝑊
y
𝜕ℒ 𝑊𝜕𝑦<*
='(
𝑤<(7𝜕ℒ 𝑊𝜕𝑧(7
𝑓z
𝑓z⋯
yz = |yzz
yzC*zzC*y*z1
𝑦% = x
1
y7z2
1
y�z�
1 1
𝑓zC*
𝑓*
𝑓*
𝑓*
𝑓*
𝑓7
𝑓7
𝑓7
𝑓7
𝑓�
𝑓�
𝑓�
𝑓�
𝑓zC*
𝑓zC*
𝑓zC*
Backward Pass – Computing Derivatives (18)
Feedforward Networks CS618 - Artificial Neural Networks 80
ℒ 𝑊
y
𝜕ℒ 𝑊𝜕𝑧<*
=𝜕𝑦<*
𝜕𝑧<*𝜕ℒ 𝑊𝜕𝑦<*
𝑓z
𝑓z⋯
yz = |yzz
yzC*zzC*y*z1
𝑦% = x
1
y7z2
1
y�z�
1 1
𝑓zC*
𝑓*
𝑓*
𝑓*
𝑓*
𝑓7
𝑓7
𝑓7
𝑓7
𝑓�
𝑓�
𝑓�
𝑓�
𝑓zC*
𝑓zC*
𝑓zC*
Backward Pass – Computing Derivatives (19)
Feedforward Networks CS618 - Artificial Neural Networks 81
ℒ 𝑊
y
𝜕ℒ 𝑊𝜕𝑤<(*
=𝜕𝑧**
𝜕𝑤<(*𝜕ℒ 𝑊𝜕𝑧(*
= 𝑦a%𝜕ℒ 𝑊𝜕𝑧(*
= 𝑥<𝜕ℒ 𝑊𝜕𝑧(*
𝑓z
𝑓z⋯
yz = |yzz
yzC*zzC*y*z1
𝑦% = x
1
y7z2
1
y�z�
1 1
𝑓zC*
𝑓*
𝑓*
𝑓*
𝑓*
𝑓7
𝑓7
𝑓7
𝑓7
𝑓�
𝑓�
𝑓�
𝑓�
𝑓zC*
𝑓zC*
𝑓zC*
Backward Pass – Summary• Compute ℒ W• Initialize the gradient with respect to the network output
𝜕ℒ W𝜕 |𝑦<
=𝜕ℒ W𝜕𝑦<z
,𝜕ℒ W𝜕𝑧<z
=𝜕𝑦<z
𝜕𝑧<z𝜕ℒ W𝜕𝑦<z
• For 𝑙 = 𝐿 − 1,… , 0• For 𝑖 = 1:Width of the 𝑙¡¢ layer
𝜕ℒ W𝜕𝑦<d
= '(
𝑤<(d?*𝜕ℒ W𝜕𝑧(d?*
𝜕ℒ W𝜕𝑧<
d =𝜕𝑦<d
𝜕𝑧<d𝜕ℒ W𝜕𝑦<
d
∀𝑗𝜕𝑦<z
𝜕𝑤<(d?* = 𝑦<d
𝜕ℒ W𝜕𝑧(
d?*
Feedforward Networks CS618 - Artificial Neural Networks 82
Special Cases – Vector Activations• The argument to the activation function is a
vector (instead of scalar)• Each 𝑧<d influences all 𝑦*d , … , 𝑦9d
• The number of outputs for vector activations need not be same as the number of inputs• Thus
𝜕ℒ𝜕𝑧<d
='(
𝜕ℒ𝜕𝑦(d
𝜕𝑦(d
𝜕𝑧<d
• Exercise – Work out for softmax
Feedforward Networks CS618 - Artificial Neural Networks 83
zd ydydC*
Special Case – Multiplicative Units• Some networks have multiplicative combination• In contrast to additive combination
𝑜<d = 𝑦(dC*𝑦5dC*
• Backward Pass – gradient computation𝜕ℒ𝜕𝑦(dC*
=𝜕𝑜<d
𝜕𝑦(dC*𝜕ℒ𝜕𝑜<d
= 𝑦5dC*𝜕ℒ𝜕𝑜<d
Feedforward Networks CS618 - Artificial Neural Networks 84
⨂
𝑧dC* 𝑦dC*
𝑜d𝑤d
Converting the Process Using Vector Operations• For layered networks it is generally simpler to think of the
process in terms of vector operations• Simpler arithmetic• Fast matrix libraries make operations much faster
• We can restate the entire process in vector terms• On slides, please read• This is what is actually used in any real system
Feedforward Networks CS618 - Artificial Neural Networks 85
Vector Form (1)
Feedforward Networks CS618 - Artificial Neural Networks 86
𝐱 =
𝑥*𝑥7⋮𝑥+
𝐳𝒍 =
𝑧*d
𝑧7d⋮𝑧+�d
𝐲𝒍 =
𝑦*d
𝑦7d⋮𝑦+�d
Wd =
𝑤**d 𝑤7*d ⋮ 𝑤+�©ª*d
𝑤*7d 𝑤77d ⋮ 𝑤+�©ª7d
⋯ ⋯ ⋱ ⋮𝑤*+�d 𝑤7+�
d ⋯ 𝑤+�©ª+�d
zd = Wd𝑦dC* yd = 𝑓d zd
⋮ ⋮
𝑥*
𝑥7
𝑥+
𝑤***
𝑤++ª*
𝑤+**
𝑧**
𝑧7*
𝑧+ª*
𝑦**
𝑦7*
𝑦+ª*
Forward Pass: |y = yz = 𝑓z Wz𝑓zC* WzC*𝑓zC7 …W7𝑓* x
Vector Form (2) – The Jacobian
• The derivative of a vector function w.r.t. vector input is called a Jacobian
• It is the matrix of partial derivatives
• zd = Wd𝑦dC* implies, 𝐽� yd = Wd
Feedforward Networks CS618 - Artificial Neural Networks 87
z𝒍 =
𝑧*d
𝑧7d⋮𝑧+�d
𝐲𝒍 =
𝑦*d
𝑦7d⋮𝑦+�d
𝐲d =
𝑦*d
𝑦7d⋮𝑦+�d
= 𝑓d𝑧*d
𝑧7d⋮𝑧+�d
= 𝑓d 𝐳d 𝐽𝐲 𝐳 =
𝜕𝑦*𝜕𝑧*
𝜕𝑦*𝜕𝑧7
⋯𝜕𝑦*𝜕𝑧+�
𝜕𝑦7𝜕𝑧*
𝜕𝑦7𝜕𝑧7
⋯𝜕𝑦7𝜕𝑧+�
⋯ ⋯ ⋱ ⋯𝜕𝑦+�𝜕𝑧*
𝜕𝑦+�𝜕𝑧7
⋯𝜕𝑦+�𝜕𝑧+�
Vector Form (3) – The Jacobian• Chain Rule for Jacobians• y = 𝑓 𝑔 x , z = 𝑔(x)• 𝐽� x = 𝐽� z 𝐽 x
• Backward Pass• ∇��ℒ = ∇®�¯ªℒ W
d?*
• ∇�ℒ = ∇��ℒ 𝐽°� zd
• ∇�±ℒ = ydC*∇�ℒ
Feedforward Networks CS618 - Artificial Neural Networks 88
𝜕ℒ W𝜕𝑦<d
= '(
𝑤<(d?*𝜕ℒ W𝜕𝑧(d?*
𝜕ℒ W𝜕𝑧<d
=𝜕𝑦<d
𝜕𝑧<d𝜕ℒ W𝜕𝑦<d
∀𝑗𝜕𝑦<z
𝜕𝑤<(d?*= 𝑦<d
𝜕ℒ W𝜕𝑧(d?*
In Summary - The Forward Pass• Set 𝐲% = 𝐱
• For layer 𝑙 = 1 to 𝐿:• Recursion:
zd = WdydC* + bdyd = 𝑓d zd
• Output:|y = yz
Feedforward Networks CS618 - Artificial Neural Networks 89
In Summary - The Backward Pass• Set y% = x, yz = |𝑦• Initialize: Compute ∇°³ℒ = ∇��ℒ
• For layer 𝑙 = 𝐿 to 1:• Compute 𝐽�� zd
• Will require intermediate values computed in the forward pass• Recursion:
∇�ℒ = ∇��ℒ 𝐽�� zd
∇��©ªℒ = ∇�ℒ Wd
• Gradient Computation:𝛻��ℒ = ydC*𝛻�ℒ𝛻´�ℒ = 𝛻�ℒ
Feedforward Networks CS618 - Artificial Neural Networks 90
Summary• Feedforward Networks• Architecture• Activation Functions
• Training Feedforward Networks• Gradient Descent• Automatic Differentiation• Scalar and Vector Mode• Backpropagation algorithm
Feedforward Networks CS618 - Artificial Neural Networks 91