artificial neural networkscse.iitrpr.ac.in/ckn/courses/f2019/cs618/w3.pdflinear regression and using...

Artificial Neural NetworksCS618 - Fall 2019

Narayanan C [email protected]

Outline• Feedforward Networks• Architecture• Activation Functions

• Training Feedforward Networks• Gradient Descent• Automatic Differentiation• Scalar and Vector Mode• Backpropagation algorithm

• Heuristics for Training Feedforward Networks

Feedforward Networks CS618 - Artificial Neural Networks 2

Acknowledgements- Content on activation functions has been adopted from the excellent material of CS231n

(Stanford)- Training the network has been adopted from the class material of 11-785 offered by Bhiksha Raj

(CMU)- Auto Differentiation has been adopted from stat 157 course by Alex Smola and Mu Li (Berkeley)


Recap• Perceptron• Perceptron Update Rule

• Multilayer Perceptron• MLP is a Universal Function

Approximator


w1

w2

wn

w0

x1

x2

xn

x0=1

.

.

.

ΣΣ wi xi

n

i=0 1 if > 0

-1 otherwise{o =Σ wi xi

n

i=0

Machine Learning Model Paradigm• Generic Architecture• Representation

• How would you like to characterize what is being learned?• Evaluation

• How would you like to measure the goodness of what is being learned• Optimization

• Given the evaluation and characterization, find the optimum representation.

• Discuss this in the context of an MLP


Linear Regression and Using Nonlinear Basis Functions (1)• Linear combination of fixed

nonlinear basis functions of the input variables

𝑦 = 𝑓 𝑤% + '()*

+

𝑤(𝜙( x

• Regression• 𝑓 ̇ - identity function

• Classification• 𝑓 ̇ - nonlinear activation


140 3. LINEAR MODELS FOR REGRESSION

−1 0 1−1

−0.5

0

0.5

1

−1 0 10

0.25

0.5

0.75

1

−1 0 10

0.25

0.5

0.75

1

Figure 3.1 Examples of basis functions, showing polynomials on the left, Gaussians of the form (3.4) in thecentre, and sigmoidal of the form (3.5) on the right.

on a regular lattice, such as the successive time points in a temporal sequence, or thepixels in an image. Useful texts on wavelets include Ogden (1997), Mallat (1999),and Vidakovic (1999).

Most of the discussion in this chapter, however, is independent of the particularchoice of basis function set, and so for most of our discussion we shall not specifythe particular form of the basis functions, except for the purposes of numerical il-lustration. Indeed, much of our discussion will be equally applicable to the situationin which the vector φ(x) of basis functions is simply the identity φ(x) = x. Fur-thermore, in order to keep the notation simple, we shall focus on the case of a singletarget variable t. However, in Section 3.1.5, we consider briefly the modificationsneeded to deal with multiple target variables.

3.1.1 Maximum likelihood and least squaresIn Chapter 1, we fitted polynomial functions to data sets by minimizing a sum-

of-squares error function. We also showed that this error function could be motivatedas the maximum likelihood solution under an assumed Gaussian noise model. Letus return to this discussion and consider the least squares approach, and its relationto maximum likelihood, in more detail.

As before, we assume that the target variable t is given by a deterministic func-tion y(x,w) with additive Gaussian noise so that

t = y(x,w) + ϵ (3.7)

where ϵ is a zero mean Gaussian random variable with precision (inverse variance)β. Thus we can write

p(t|x,w, β) = N (t|y(x,w), β−1). (3.8)

Recall that, if we assume a squared loss function, then the optimal prediction, for anew value of x, will be given by the conditional mean of the target variable. In theSection 1.5.5case of a Gaussian conditional distribution of the form (3.8), the conditional mean

Linear Regression and Using Nonlinear Basis Functions (2)• The nonlinear basis functions were predefined• Extend it such that• 𝜙( depend on parameters 𝑤(• The parameters 𝑤( should be learned.

• Feedforward networks is one method to construct these parametric nonlinear basis functions.


Basic Feedforward Network Model (1)• Key difference from Multilayer Perceptron - nonlinear activation

function


w1

w2

wn

w0

x1

x2

xn

x0 = 1

.

.

.

Σnet = Σ wi xii=0

n1

1 + e-net

o = σ(net) =

Basic Feedforward Network Model (2)• Input - x• Connection weights between

input and hidden layer – 𝑤(/• Output of the ℎ1/ node in the

hidden layer - 𝑧/

𝑧/ = 𝑓* '()%

+

𝑤(/* 𝑥(

• 𝑓* ̇ - activation function• 𝑧/ - activations at the ℎ1/ node


𝑧/

𝑜* 𝑜5 𝑜6

𝑤%57

…

𝑥(

𝑤*57

𝑤(/*

𝑧% 𝑧* 𝑧9…𝑤/57 𝑤957

𝑥% 𝑥* … 𝑥+

𝑤+/*𝑤*/*𝑤%/*

=1

=1

Basic Feedforward Network Model (3)• Connection weights between

hidden and output layer 𝑣5/• Output of the 𝑘1/ output layer

node 𝑜5

𝑜5 = 𝑓7 '()%

9

𝑤(57 𝑧(

• 𝑓7 ̇ - activation of the output layer.• Identity for a regression task


𝑧/

𝑜* 𝑜5 𝑜6

𝑤%57

…

𝑥(

𝑤*57

𝑤(/*

𝑧% 𝑧* 𝑧9…𝑤/57 𝑤957

𝑥% 𝑥* … 𝑥+

𝑤+/*𝑤*/*𝑤%/*

=1

=1

Example: 2- Layer Feedforward Network

𝑜 = 𝑓7 '()%

9

𝑤(57 𝑓* '<)%

+

𝑤</* 𝑥<


𝑧/

𝑜* 𝑜5 𝑜6

𝑤%57

…

𝑥(

𝑤*57

𝑤(/*

𝑧% 𝑧* 𝑧9…𝑤/57 𝑤957

𝑥% 𝑥* … 𝑥+

𝑤+/*𝑤*/*𝑤%/*

=1

=1

Linear Feedforward Network• If the activations of all the hidden units in a network are taken to

be linear,• 𝑓/ 𝑧 = 𝑎/𝑧 + 𝑏/

• There is always an equivalent network without the hidden units• Rather, the hidden layers can be collapsed into a single unit• Composition of successive linear transformation is itself a linear

transformation• However, if the number of hidden units is less than the number of input

or output units, then there is some loss of information • Linear Feedforward Networks are of little interest to us


Activation Functions - Sigmoid

• f 𝑥 = **?@AB CD

• Large negative and positive inputs become 0 and 1 respectively.• Advantage• Continuous and differentiable• 𝑓E 𝑥 = 𝑓 𝑥 1 − 𝑓(𝑥)


Activation Functions – Sigmoid Drawbacks• Saturation – a problem during

the learning/training phase• Gradient at the tails are 0

• Outputs are not zero centered • Are always positive• Has issues during gradient

update (all positive or negative updates)


Activation Functions - tanh• 𝑓 𝑥 = 2𝜎 2𝑥 − 1• Large negative and positive

inputs become -1 and 1 respectively.• Advantage• Continuous and differentiable• 𝑓E 𝑥 = 1 − 𝑓7 𝑥• Zero-centered output


Activation Functions - ReLU• Rectified Linear Unit• 𝑓 𝑥 = max 0, 𝑥• Activation is simply

thresholded at 0.• Advantage• Computationally more efficient• non-saturation• Sparse activation


Activation Functions - ReLU• Disadvantages• Non-zero centered• Unbounded• Dying ReLU – units pushed into

states in which they become inactive for all inputs!• Observed when the learning rate

is too high• Not differentiable at 0, instead

use subgradients


Diversion: Subgradients (1)• A subgradient of a function 𝑓:ℛ+ → ℛ is any value (vector) 𝑔 ∈ ℛ+ such that

∀𝑥, 𝑦 ∈ 𝐷 𝑓 , 𝑓 𝑦≥ 𝑓 𝑥 + 𝑔X 𝑦 − 𝑥

• It is a generalization of gradient for all functions• If 𝑓 is differentiable then there

is a unique subgradient ∀𝑥 ∈𝐷 𝑓 , 𝑔 = ∇𝑓 𝑥

• Example


Diversion: Subgradients (2)


• Can use any subgradient• At the differentiable points on the curve, this is the same as the gradient• Typically, will use the equation given

𝑓′(𝑧) = 1

𝑓′(𝑧) = 0 𝑧

𝑓′(𝑧) = [0, 𝑧 < 01, 𝑧 ≥ 0

Activation Functions – Leaky ReLU• 𝑓 𝑥 = 𝕀 𝑥 < 0 𝛼𝑥 +𝕀 𝑥 ≥ 0 𝑥• 𝛼 is a small constant

• Advantage• Reduces the dying unit problem

• Results are not consistent


f (y) = y

y

f (y)

f (y) = y

f (y) = ay

y

f (y)

f (y) = 0

Figure 1. ReLU vs. PReLU. For PReLU, the coefficient of thenegative part is not constant and is adaptively learned.

2. ApproachIn this section, we first present the PReLU activation

function (Sec. 2.1). Then we derive our initializationmethod for deep rectifier networks (Sec. 2.2). Lastly wediscuss our architecture designs (Sec. 2.3).

2.1. Parametric RectifiersWe show that replacing the parameter-free ReLU activa-

tion by a learned parametric activation unit improves clas-sification accuracy1.

Definition

Formally, we consider an activation function defined as:

f(yi) =

(yi, if yi > 0

aiyi, if yi 0. (1)

Here yi is the input of the nonlinear activation f on the ithchannel, and ai is a coefficient controlling the slope of thenegative part. The subscript i in ai indicates that we allowthe nonlinear activation to vary on different channels. Whenai = 0, it becomes ReLU; when ai is a learnable parameter,we refer to Eqn.(1) as Parametric ReLU (PReLU). Figure 1shows the shapes of ReLU and PReLU. Eqn.(1) is equiva-lent to f(yi) = max(0, yi) + ai min(0, yi).

If ai is a small and fixed value, PReLU becomes theLeaky ReLU (LReLU) in [20] (ai = 0.01). The motiva-tion of LReLU is to avoid zero gradients. Experiments in[20] show that LReLU has negligible impact on accuracycompared with ReLU. On the contrary, our method adap-tively learns the PReLU parameters jointly with the wholemodel. We hope for end-to-end training that will lead tomore specialized activations.

PReLU introduces a very small number of extra param-eters. The number of extra parameters is equal to the totalnumber of channels, which is negligible when consideringthe total number of weights. So we expect no extra riskof overfitting. We also consider a channel-shared variant:

1Concurrent with our work, Agostinelli et al. [1] also investigatedlearning activation functions and showed improvement on other tasks.

f(yi) = max(0, yi) + amin(0, yi) where the coefficient isshared by all channels of one layer. This variant only intro-duces a single extra parameter into each layer.

Optimization

PReLU can be trained using backpropagation [17] and opti-mized simultaneously with other layers. The update formu-lations of {ai} are simply derived from the chain rule. Thegradient of ai for one layer is:

@E@ai

=X

yi

@E@f(yi)

@f(yi)

@ai, (2)

where E represents the objective function. The term @E@f(yi)

is the gradient propagated from the deeper layer. The gradi-ent of the activation is given by:

@f(yi)

@ai=

(0, if yi > 0

yi, if yi 0. (3)

The summationP

yiruns over all positions of the feature

map. For the channel-shared variant, the gradient of a is@E@a =

Pi

Pyi

@E@f(yi)

@f(yi)@a , where

Pi sums over all chan-

nels of the layer. The time complexity due to PReLU isnegligible for both forward and backward propagation.

We adopt the momentum method when updating ai:

�ai := µ�ai + ✏@E@ai

. (4)

Here µ is the momentum and ✏ is the learning rate. It isworth noticing that we do not use weight decay (l2 regular-ization) when updating ai. A weight decay tends to push aito zero, and thus biases PReLU toward ReLU. Even withoutregularization, the learned coefficients rarely have a magni-tude larger than 1 in our experiments. Further, we do notconstrain the range of ai so that the activation function maybe non-monotonic. We use ai = 0.25 as the initializationthroughout this paper.

Comparison Experiments

We conducted comparisons on a deep but efficient modelwith 14 weight layers. The model was studied in [10](model E of [10]) and its architecture is described in Ta-ble 1. We choose this model because it is sufficient for rep-resenting a category of very deep models, as well as to makethe experiments feasible.

As a baseline, we train this model with ReLU appliedin the convolutional (conv) layers and the first two fully-connected (fc) layers. The training implementation follows[10]. The top-1 and top-5 errors are 33.82% and 13.34% onImageNet 2012, using 10-view testing (Table 2).

2

Activation Functions - Summary• Never use Sigmoid, prefer Tanh over sigmoid• Better choice would be ReLU or Leaky ReLU


Skip Connections• Generic Feedforward

networks can have skip connections• Connections that skip a layer• In principle, networks with

sigmoidal hidden units can always mimic skip connections

230 5. NEURAL NETWORKS

Figure 5.2 Example of a neural network having ageneral feed-forward topology. Note thateach hidden and output unit has anassociated bias parameter (omitted forclarity).

x1

x2

z1

z3

z2

y1

y2

inputs outputs

instance, in a two-layer network these would go directly from inputs to outputs. Inprinciple, a network with sigmoidal hidden units can always mimic skip layer con-nections (for bounded input values) by using a sufficiently small first-layer weightthat, over its operating range, the hidden unit is effectively linear, and then com-pensating with a large weight value from the hidden unit to the output. In practice,however, it may be advantageous to include skip-layer connections explicitly.

Furthermore, the network can be sparse, with not all possible connections withina layer being present. We shall see an example of a sparse network architecture whenwe consider convolutional neural networks in Section 5.5.6.

Because there is a direct correspondence between a network diagram and itsmathematical function, we can develop more general network mappings by con-sidering more complex network diagrams. However, these must be restricted to afeed-forward architecture, in other words to one having no closed directed cycles, toensure that the outputs are deterministic functions of the inputs. This is illustratedwith a simple example in Figure 5.2. Each (hidden or output) unit in such a networkcomputes a function given by

zk = h

!"

j

wkjzj

#(5.10)

where the sum runs over all units that send connections to unit k (and a bias param-eter is included in the summation). For a given set of values applied to the inputs ofthe network, successive application of (5.10) allows the activations of all units in thenetwork to be evaluated including those of the output units.

The approximation properties of feed-forward networks have been widely stud-ied (Funahashi, 1989; Cybenko, 1989; Hornik et al., 1989; Stinchecombe and White,1989; Cotter, 1990; Ito, 1991; Hornik, 1991; Kreinovich, 1991; Ripley, 1996) andfound to be very general. Neural networks are therefore said to be universal ap-proximators. For example, a two-layer network with linear outputs can uniformlyapproximate any continuous function on a compact input domain to arbitrary accu-racy provided the network has a sufficiently large number of hidden units. This resultholds for a wide range of hidden unit activation functions, but excluding polynomi-als. Although such theorems are reassuring, the key problem is how to find suitableparameter values given a set of training data, and in later sections of this chapter we


Weight Space Symmetries (1)• Multiple distinct choices for the weight vector can all give rise to

the same mapping function from the inputs to the outputs


𝑧/

𝑜* 𝑜5 𝑜6

𝑤%57

…

𝑥(

𝑤*57

𝑤(/*

𝑧% 𝑧* 𝑧9…𝑤/57 𝑤957

𝑥% 𝑥* … 𝑥+

𝑤+/*𝑤*/*𝑤%/*

=1

=1

𝑧*

𝑜* 𝑜5 𝑜6

𝑤%57

…

𝑥(

𝑤*57

𝑤(/*

𝑧% 𝑧/ 𝑧9…𝑤/57 𝑤957

𝑥% 𝑥* … 𝑥+

𝑤+/*𝑤*/*𝑤%/*

=1

=1

Weight Space Symmetries (2)• Multiple distinct choices for the weight vector can all give rise to

the same mapping function from the inputs to the outputs• Even though the mapping is the same, the weight vectors are

different.• Given 𝐻 hidden units, there are 𝐻! different orderings


Machine Learning Model Paradigm• Generic Architecture• Representation

• How would you like to characterize what is being learned?• Evaluation

• How would you like to measure the goodness of what is being learned• Optimization

• Given the evaluation and characterization, find the optimum representation.


Problem Statement• Given a training set of input and output pairs xa, 𝑦< <)*

b

• We have parameterized the function 𝑓:ℛ+ → 𝒴 as the neural network.• Parameters are 𝑤(/d - weights of the neural network

• The next step would be define an evaluation function ℒ in terms of the parameters of the network.

ℒ 𝐖 = ⋯


Representing the Input

• Vectors of numbers • (or may even be just a scalar, if input layer is of size 1)• E.g. vector of pixel values• E.g. vector of speech features• E.g. real-valued vector representing text• Other real valued vectors


InputLayer Output

Layer

Hidden Layers

Representing the Output (Regression)

• If the desired output is real-valued• Scalar Output : single output neuron - 𝑜 without any activation• Vector Output : as many output neurons as the dimension of the

desired output – o = [𝑜*, 𝑜7, . . , 𝑜6]


Representing the Output (Classification) (1)• If the desired output is binary (0/1 or -1/+1)• Sigmoid Output : single output neuron - 𝑜 with sigmoid activation• Viewed as the probability 𝑃 𝑌 = 1|𝑋 of class value 1

• Indicating the fact that for actual data, in general a feature value x may occur in both classes, but with different probabilities

• Is differentiable


Representing the Output (Classification) (2)• Consider a network that must distinguish the input as one of the

following classes – dog, cat, cow, or horse• We can represent this set as the following vector: dog cat cow horse X

• For inputs of each of the four classes the desired output is: • dog: 1 0 0 0 X, cat: 0 1 0 0 X, cow: 0 0 1 0 X, horse: 0 0 0 1 X

• For an input of any class, we will have a four-dimensional vector output with three zeros and a single 1 at the position of that class• This is a one hot vector representation


Representing the Output (Classification) (3)• For a multi-class classifier with 𝐾 classes, the one-hot

representation will have 𝐾 binary outputs• The neural network’s output too must ideally be binary (𝐾 − 1

zeros and a single 1 in the right place)

• More realistically, it will be a probability vector of 𝐾 dimensions• 𝐾 probability values that sum to 1.


Representing the Output (Classification) (4)• Softmax vector activation is often used at the output of multi-

class classifier nets𝑧5z = w5

z. zzC*

|𝑦5 =exp 𝑧5z

∑( exp 𝑧(z

• This can be viewed as the probability |𝑦5 = 𝑃(class = 𝑘|x)


Representing the Output (Classification) (5)• Consider a network that must identify all/some of the objects

present in an image• We can represent this set as the following vector: dog cat cow horse X

• Each class is however treated independently. (no longer a one-hot encoding)• Output activations will be individual Sigmoids (not softmax)


Defining the Evaluation Function• Given a training set of input and output pairs xa, 𝑦< <)*

b

• The feedforward pass of the network provides the outputs |𝑦< <)*

b

• We can now define the evaluation function ℒ 𝐖


Evaluation Functions - 𝐿7 divergence

• For real-valued output vectors, the 𝐿7 divergence is popular

ℓ W; x, y =12|y − y 7 =

12'()*

+

|𝑦( − 𝑦(7

• Squared Euclidean distance between true and desired output• Note: this is differentiable

𝑑ℓ W; x, y𝑑 |𝑦(

= �𝑦( − 𝑦(

∇�ℒ W = |𝑦* − 𝑦*, |𝑦7 − 𝑦7, … , |𝑦6 − 𝑦6 X


Evaluation Functions – Cross Entropy Error• For binary classifier, the actual output 𝑦 ∈ 0,1• This can be interpreted as true probability distribution 𝑦, 1 − 𝑦

• The predicted probability distribution is [ |𝑦, 1 − |𝑦]

• Then, we can estimate the Kullback-Leibler (KL) divergence between the two distributions

ℓ W; x, y = −𝑦 log |𝑦 − 1 − 𝑦 log(1 − |𝑦)• Minimum occurs when |𝑦 = 𝑦

• Derivative = ?


𝑑ℓ W; x, y𝑑 |𝑦 =

−1|𝑦 𝑖𝑓 𝑦 = 1

11 − |𝑦 𝑖𝑓 𝑦 = 0

Evaluation Functions – Softmax Loss• Desired output y is a one hot vector 0 0…1 …0 0 0 with the 1 in

the position corresponding to the true class• Actual output will be probability distribution |𝑦• The KL divergence between the desired one-hot output and actual

output:

ℓ W; x, y = −'5

𝑦5 log |𝑦5 = − log |𝑦5

• Derivative - ?


𝑑ℓ W; x, y𝑑 |𝑦5

= �−1|𝑦5

for the 𝑘1/ component

0 for remaining component

𝛻��ℓ W; x, y = 0 0 …−1|𝑦5…0 0

So Far• Representation• How would you like to characterize what is being learned? – W – neural

network parameters• Evaluation• How would you like to measure the goodness of what is being learned -ℒ W - evaluation functions

• Optimization• Given the evaluation and characterization, find the optimum

representation• Solution – Gradient Descent!


Training Feedforward Networks Through Gradient Descent• Given a training set of input and output pairs xa, 𝑦< <)*

b

• We have parameterized the function 𝑓:ℛ+ → 𝒴 as the neural network.• Parameters are 𝑤(/d - weights of the neural network

• An evaluation function ℒ in terms of the parameters of the network to measure the goodness of the function• Minimize ℒ W with respect to 𝑤(/d


Quick Recap – Gradient Descent• In order to minimize any function 𝑓(x) w.r.t. x• Initialize: • x%• 𝑡 = 0

• While |𝑓(x1?* ) − 𝑓(x1 )| > 𝜀• x1?* = x1 − 𝜂1 𝛻𝑓 x1• 𝑡 = 𝑡 + 1


Training Feedforward Networks Through Gradient Descent• Total Training Error - ℒ W = *

b∑<)*b ℓ W; x<, ya

• Initialize all weights – 𝑤(/d

• Do:• For every layer 𝑙, for all 𝑗, ℎ update:

𝑤(/d ← 𝑤(/d − 𝜂𝑑ℒ W𝑑𝑤(/

d

• Until ℒ W has converged


Training Feedforward Networks Through Gradient Descent• Total Training Error - ℒ W = *

b∑<)*b ℓ W; x<, ya

• Initialize all weights – 𝑤(/d

• Do:• For every layer 𝑙, for all 𝑗, ℎ update:

𝑤(/d ← 𝑤(/d − 𝜂𝑑ℒ W𝑑𝑤(/

d

• Until ℒ W has converged• So how to compute �ℒ �

��


Key Principle – Chain Rule of Differentiation

𝑦 = 𝑓 𝑔* 𝑥 , 𝑔* 𝑥 , … , 𝑔� 𝑥• Then,

𝑑𝑦𝑑𝑥

=𝜕𝑓

𝜕𝑔*(𝑥)𝑑𝑔*(𝑥)𝑑𝑥

+𝜕𝑓

𝜕𝑔7(𝑥)𝑑𝑔7(𝑥)𝑑𝑥

+⋯+𝜕𝑓

𝜕𝑔�(𝑥)𝑑𝑔�(𝑥)𝑑𝑥


Generalization to Matrices


Digression – Automatic Differentiation (1)• Automatic Differentiation• No need for explicit form of the derivative

• Difficult to compute it for extremely complex functions• Better than numerical differentiation

• Estimate and not accurate


Digression – Automatic Differentiation (2)Computation Graph


Digression – Automatic Differentiation (3)Two Modes


Digression – Automatic Differentiation (4)Reverse Accumulation




• Build a computation graph• Forward: Evaluate the graph and store the intermediate results• Backward: Evaluate the graph in a reversed order• Eliminate paths not needed

• Computational Complexity: Similar to the forward pass• Memory Complexity High – as the intermediate results need to

be stored.

Applying Forward and Backward Pass to a Feedforward Network


𝑓z

𝑓z⋯

𝑓zC*

yz = |yzz

yzC*zzC*

𝑓*

y*z1

𝑦% = x

𝑓*

𝑓*

𝑓*

1𝑓7

y7z2

𝑓7

𝑓7

𝑓7

1𝑓�

y�z�

𝑓�

𝑓�

𝑓�

1

𝑓zC*

𝑓zC*

𝑓zC*

1

Generic Forward Pass (1)


𝑦% = x

𝑓z

𝑓z⋯

𝑓zC*

yz = |yzz

yzC*zzC*

𝑓*

y*z1

𝑓*

𝑓*

𝑓*

1𝑓7

y7z2

𝑓7

𝑓7

𝑓7

1𝑓�

y�z�

𝑓�

𝑓�

𝑓�

1

𝑓zC*

𝑓zC*

𝑓zC*

1



𝑓z

𝑓z⋯

yz = |yzz

yzC*zzC*y*z1

𝑦% = x

1

y7z2

1

y�z�

1 1

𝑧(* ='<

𝑤<(* 𝑦<%

𝑓zC*

𝑓*

𝑓*

𝑓*

𝑓*

𝑓7

𝑓7

𝑓7

𝑓7

𝑓�

𝑓�

𝑓�

𝑓�

𝑓zC*

𝑓zC*

𝑓zC*



𝑓z

𝑓z⋯

yz = |yzz

yzC*zzC*y*z1

𝑦% = x

1

y7z2

1

y�z�

1 1

𝑧(* ='<

𝑤<(* 𝑦<% 𝑦(* = 𝑓* 𝑧(*

𝑓zC*

𝑓*

𝑓*

𝑓*

𝑓*

𝑓7

𝑓7

𝑓7

𝑓7

𝑓�

𝑓�

𝑓�

𝑓�

𝑓zC*

𝑓zC*

𝑓zC*



𝑓z

𝑓z⋯

yz = |yzz

yzC*zzC*y*z1

𝑦% = x

1

y7z2

1

y�z�

1 1

𝑧(* ='<

𝑤<(* 𝑦<% 𝑦(* = 𝑓* 𝑧(* 𝑧(7 ='<

𝑤<(7𝑦<*

𝑓zC*

𝑓*

𝑓*

𝑓*

𝑓*

𝑓7

𝑓7

𝑓7

𝑓7

𝑓�

𝑓�

𝑓�

𝑓�

𝑓zC*

𝑓zC*

𝑓zC*



𝑓z

𝑓z⋯

yz = |yzz

yzC*zzC*y*z1

𝑦% = x

1

y7z2

1

y�z�

1 1

𝑧(* ='<

𝑤<(* 𝑦<% 𝑦(* = 𝑓* 𝑧(* 𝑧(7 ='<

𝑤<(7𝑦<* 𝑦(7 = 𝑓7 𝑧(7

𝑓zC*

𝑓*

𝑓*

𝑓*

𝑓*

𝑓7

𝑓7

𝑓7

𝑓7

𝑓�

𝑓�

𝑓�

𝑓�

𝑓zC*

𝑓zC*

𝑓zC*



𝑓z

𝑓z⋯

yz = |yzz

yzC*zzC*y*z1

𝑦% = x

1

y7z2

1

y�z�

1 1

𝑧(* ='<

𝑤<(* 𝑦<% 𝑦(* = 𝑓* 𝑧(* 𝑧(7 ='<

𝑤<(7𝑦<* 𝑦(7 = 𝑓7 𝑧(7 𝑧(� ='<

𝑤<(�𝑦<7

𝑓zC*

𝑓*

𝑓*

𝑓*

𝑓*

𝑓7

𝑓7

𝑓7

𝑓7

𝑓�

𝑓�

𝑓�

𝑓�

𝑓zC*

𝑓zC*

𝑓zC*



𝑓z

𝑓z⋯

yz = |yzz

yzC*zzC*y*z1

𝑦% = x

1

y7z2

1

y�z�

1 1

𝑧(* ='<

𝑤<(* 𝑦<% 𝑦(* = 𝑓* 𝑧(* 𝑧(7 ='<

𝑤<(7𝑦<* 𝑦(7 = 𝑓7 𝑧(7 𝑧(� ='<

𝑤<(�𝑦<7 𝑦(� = 𝑓� 𝑧(� ⋯

𝑓zC*

𝑓*

𝑓*

𝑓*

𝑓*

𝑓7

𝑓7

𝑓7

𝑓7

𝑓�

𝑓�

𝑓�

𝑓�

𝑓zC*

𝑓zC*

𝑓zC*



𝑓z

𝑓z⋯

yz = |yzz

yzC*zzC*y*z1

𝑦% = x

1

y7z2

1

y�z�

1 1

𝑧(z ='<

𝑤<(z 𝑦<zC* 𝑦(z = 𝑓z 𝑧(z𝑦(zC* = 𝑓zC* 𝑧(zC*

𝑓zC*

𝑓*

𝑓*

𝑓*

𝑓*

𝑓7

𝑓7

𝑓7

𝑓7

𝑓�

𝑓�

𝑓�

𝑓�

𝑓zC*

𝑓zC*

𝑓zC*



𝑓z

𝑓z⋯

yz = |yzz

yzC*zzC*y*z1

𝑦% = x

1

y7z2

1

y�z�

1 1

𝑧(d = '<

𝑤<(d 𝑦<dC* 𝑦(dC* = 𝑓dC* 𝑧(dC*y% = x

𝑓zC*

𝑓*

𝑓*

𝑓*

𝑓*

𝑓7

𝑓7

𝑓7

𝑓7

𝑓�

𝑓�

𝑓�

𝑓�

𝑓zC*

𝑓zC*

𝑓zC*

Backward Pass – Computing Derivatives (1)


𝑓z

𝑓z⋯

yz = |yzz

yzC*zzC*y*z1

𝑦% = x

1

y7z2

1

y�z�

1 1

All the intermediate values estimated in the forward pass must be stored – we will need them to compute the derivatives

𝑓zC*

𝑓*

𝑓*

𝑓*

𝑓*

𝑓7

𝑓7

𝑓7

𝑓7

𝑓�

𝑓�

𝑓�

𝑓�

𝑓zC*

𝑓zC*

𝑓zC*



𝑓z

𝑓z⋯

yz = |yzz

yzC*zzC*y*z1

𝑦% = x

1

y7z2

1

y�z�

1 1

ℒ 𝑊

y

𝑓zC*

𝑓*

𝑓*

𝑓*

𝑓*

𝑓7

𝑓7

𝑓7

𝑓7

𝑓�

𝑓�

𝑓�

𝑓�

𝑓zC*

𝑓zC*

𝑓zC*



𝑓z

𝑓z⋯

yz = |yzz

yzC*zzC*y*z1

𝑦% = x

1

y7z2

1

y�z�

1 1

ℒ 𝑊

y

𝜕ℒ 𝑊𝜕 |𝑦<

=𝜕ℒ 𝑊𝜕𝑦<z

𝑓zC*

𝑓*

𝑓*

𝑓*

𝑓*

𝑓7

𝑓7

𝑓7

𝑓7

𝑓�

𝑓�

𝑓�

𝑓�

𝑓zC*

𝑓zC*

𝑓zC*



𝑓z

𝑓z⋯

yz = |yzz

yzC*zzC*y*z1

𝑦% = x

1

y7z2

1

y�z�

1 1

ℒ 𝑊

y

𝜕ℒ 𝑊𝜕 𝑧*z

=𝜕ℒ 𝑊𝜕𝑦*z

𝜕𝑦*z

𝜕𝑧*z

Already computed

𝑓zC*

𝑓*

𝑓*

𝑓*

𝑓*

𝑓7

𝑓7

𝑓7

𝑓7

𝑓�

𝑓�

𝑓�

𝑓�

𝑓zC*

𝑓zC*

𝑓zC*



𝑓z

𝑓z⋯

yz = |yzz

yzC*zzC*y*z1

𝑦% = x

1

y7z2

1

y�z�

1 1

ℒ 𝑊

y

𝜕ℒ 𝑊𝜕 𝑧*z

=𝜕ℒ 𝑊𝜕𝑦*z

𝜕𝑦*z

𝜕𝑧*z

Derivative of the activation function 𝑓z 𝑧*z

𝑓zC*

𝑓*

𝑓*

𝑓*

𝑓*

𝑓7

𝑓7

𝑓7

𝑓7

𝑓�

𝑓�

𝑓�

𝑓�

𝑓zC*

𝑓zC*

𝑓zC*



𝑓z

𝑓z⋯

yz = |yzz

yzC*zzC*y*z1

𝑦% = x

1

y7z2

1

y�z�

1 1

ℒ 𝑊

y

𝜕ℒ 𝑊𝜕 𝑧<z

=𝜕ℒ 𝑊𝜕𝑦<z

𝜕𝑦<z

𝜕𝑧<z

𝑓zC*

𝑓*

𝑓*

𝑓*

𝑓*

𝑓7

𝑓7

𝑓7

𝑓7

𝑓�

𝑓�

𝑓�

𝑓�

𝑓zC*

𝑓zC*

𝑓zC*



𝑓z

𝑓z⋯

yz = |yzz

yzC*zzC*y*z1

𝑦% = x

1

y7z2

1

y�z�

1 1

ℒ 𝑊

y

𝜕ℒ 𝑊𝜕 𝑤**z

=𝜕ℒ 𝑊𝜕 𝑧*z

𝜕𝑧*z

𝜕𝑤**zComputed in the previous step

𝑓zC*

𝑓*

𝑓*

𝑓*

𝑓*

𝑓7

𝑓7

𝑓7

𝑓7

𝑓�

𝑓�

𝑓�

𝑓�

𝑓zC*

𝑓zC*

𝑓zC*



𝑓z

𝑓z⋯

yz = |yzz

yzC*zzC*y*z1

𝑦% = x

1

y7z2

1

y�z�

1 1

ℒ 𝑊

y

𝜕ℒ 𝑊𝜕 𝑤**z

=𝜕ℒ 𝑊𝜕 𝑧*z

𝜕𝑧*z

𝜕𝑤**z

𝑧*z = 𝑤**z 𝑦*zC* + ⋯

𝜕𝑧*z

𝜕𝑤**z= 𝑦*zC*

𝑓zC*

𝑓*

𝑓*

𝑓*

𝑓*

𝑓7

𝑓7

𝑓7

𝑓7

𝑓�

𝑓�

𝑓�

𝑓�

𝑓zC*

𝑓zC*

𝑓zC*



𝑓z

𝑓z⋯

yz = |yzz

yzC*zzC*y*z1

𝑦% = x

1

y7z2

1

y�z�

1 1

ℒ 𝑊

y

𝜕ℒ 𝑊𝜕𝑤<(z

=𝜕ℒ 𝑊𝜕𝑧(z

𝜕𝑧*z

𝜕𝑤<(z= yazC*

𝜕ℒ 𝑊𝜕𝑧(z

𝑓zC*

𝑓*

𝑓*

𝑓*

𝑓*

𝑓7

𝑓7

𝑓7

𝑓7

𝑓�

𝑓�

𝑓�

𝑓�

𝑓zC*

𝑓zC*

𝑓zC*



𝑓z

𝑓z⋯

yz = |yzz

yzC*zzC*y*z1

𝑦% = x

1

y7z2

1

y�z�

1 1

ℒ 𝑊

y

𝜕ℒ 𝑊𝜕𝑦*zC*

='(

𝜕ℒ 𝑊𝜕𝑧(zC*

𝜕𝑧(zC*

𝜕𝑦*zC*Computed a couple of steps earlier

𝑓zC*

𝑓*

𝑓*

𝑓*

𝑓*

𝑓7

𝑓7

𝑓7

𝑓7

𝑓�

𝑓�

𝑓�

𝑓�

𝑓zC*

𝑓zC*

𝑓zC*



ℒ 𝑊

y


='(

𝜕ℒ 𝑊𝜕𝑧(zC*

𝜕𝑧(zC*

𝜕𝑦*zC*

𝑧(z = 𝑤*(z 𝑦*zC* + ⋯

𝜕𝑧(z

𝜕𝑦*zC*= 𝑤*(zC*

𝑓z

𝑓z⋯

yz = |yzz

yzC*zzC*y*z1

𝑦% = x

1

y7z2

1

y�z�

1 1

𝑓zC*

𝑓*

𝑓*

𝑓*

𝑓*

𝑓7

𝑓7

𝑓7

𝑓7

𝑓�

𝑓�

𝑓�

𝑓�

𝑓zC*

𝑓zC*

𝑓zC*



ℒ 𝑊

y


='(

𝑤*(zC*𝜕ℒ 𝑊𝜕𝑧(zC*

𝑓z

𝑓z⋯

yz = |yzz

yzC*zzC*y*z1

𝑦% = x

1

y7z2

1

y�z�

1 1

𝑓zC*

𝑓*

𝑓*

𝑓*

𝑓*

𝑓7

𝑓7

𝑓7

𝑓7

𝑓�

𝑓�

𝑓�

𝑓�

𝑓zC*

𝑓zC*

𝑓zC*



ℒ 𝑊

y

𝜕ℒ 𝑊𝜕𝑦<zC*

='(

𝑤<(zC*𝜕ℒ 𝑊𝜕𝑧(zC*

𝑓z

𝑓z⋯

yz = |yzz

yzC*zzC*y*z1

𝑦% = x

1

y7z2

1

y�z�

1 1

𝑓zC*

𝑓*

𝑓*

𝑓*

𝑓*

𝑓7

𝑓7

𝑓7

𝑓7

𝑓�

𝑓�

𝑓�

𝑓�

𝑓zC*

𝑓zC*

𝑓zC*



ℒ 𝑊

y

𝜕ℒ 𝑊𝜕𝑧<zC*

=𝜕𝑦<zC*

𝜕𝑧<zC*𝜕ℒ 𝑊𝜕𝑦<zC*

𝑓z

𝑓z⋯

yz = |yzz

yzC*zzC*y*z1

𝑦% = x

1

y7z2

1

y�z�

1 1

𝑓zC*

𝑓*

𝑓*

𝑓*

𝑓*

𝑓7

𝑓7

𝑓7

𝑓7

𝑓�

𝑓�

𝑓�

𝑓�

𝑓zC*

𝑓zC*

𝑓zC*



ℒ 𝑊

y

𝜕ℒ 𝑊𝜕𝑦<�

='(

𝑤<( 𝜕ℒ 𝑊𝜕𝑧(

𝑓z

𝑓z⋯

yz = |yzz

yzC*zzC*y*z1

𝑦% = x

1

y7z2

1

y�z�

1 1

𝑓zC*

𝑓*

𝑓*

𝑓*

𝑓*

𝑓7

𝑓7

𝑓7

𝑓7

𝑓�

𝑓�

𝑓�

𝑓�

𝑓zC*

𝑓zC*

𝑓zC*



ℒ 𝑊

y

𝜕ℒ 𝑊𝜕𝑧<�

=𝜕𝑦<�

𝜕𝑧<�𝜕ℒ 𝑊𝜕𝑦<�

𝑓z

𝑓z⋯

yz = |yzz

yzC*zzC*y*z1

𝑦% = x

1

y7z2

1

y�z�

1 1

𝑓zC*

𝑓*

𝑓*

𝑓*

𝑓*

𝑓7

𝑓7

𝑓7

𝑓7

𝑓�

𝑓�

𝑓�

𝑓�

𝑓zC*

𝑓zC*

𝑓zC*



ℒ 𝑊

y

𝜕ℒ 𝑊𝜕𝑦<*

='(

𝑤<(7𝜕ℒ 𝑊𝜕𝑧(7

𝑓z

𝑓z⋯

yz = |yzz

yzC*zzC*y*z1

𝑦% = x

1

y7z2

1

y�z�

1 1

𝑓zC*

𝑓*

𝑓*

𝑓*

𝑓*

𝑓7

𝑓7

𝑓7

𝑓7

𝑓�

𝑓�

𝑓�

𝑓�

𝑓zC*

𝑓zC*

𝑓zC*



ℒ 𝑊

y

𝜕ℒ 𝑊𝜕𝑧<*

=𝜕𝑦<*

𝜕𝑧<*𝜕ℒ 𝑊𝜕𝑦<*

𝑓z

𝑓z⋯

yz = |yzz

yzC*zzC*y*z1

𝑦% = x

1

y7z2

1

y�z�

1 1

𝑓zC*

𝑓*

𝑓*

𝑓*

𝑓*

𝑓7

𝑓7

𝑓7

𝑓7

𝑓�

𝑓�

𝑓�

𝑓�

𝑓zC*

𝑓zC*

𝑓zC*



ℒ 𝑊

y

𝜕ℒ 𝑊𝜕𝑤<(*

=𝜕𝑧**

𝜕𝑤<(*𝜕ℒ 𝑊𝜕𝑧(*

= 𝑦a%𝜕ℒ 𝑊𝜕𝑧(*

= 𝑥<𝜕ℒ 𝑊𝜕𝑧(*

𝑓z

𝑓z⋯

yz = |yzz

yzC*zzC*y*z1

𝑦% = x

1

y7z2

1

y�z�

1 1

𝑓zC*

𝑓*

𝑓*

𝑓*

𝑓*

𝑓7

𝑓7

𝑓7

𝑓7

𝑓�

𝑓�

𝑓�

𝑓�

𝑓zC*

𝑓zC*

𝑓zC*

Backward Pass – Summary• Compute ℒ W• Initialize the gradient with respect to the network output

𝜕ℒ W𝜕 |𝑦<

=𝜕ℒ W𝜕𝑦<z

,𝜕ℒ W𝜕𝑧<z

=𝜕𝑦<z

𝜕𝑧<z𝜕ℒ W𝜕𝑦<z

• For 𝑙 = 𝐿 − 1,… , 0• For 𝑖 = 1:Width of the 𝑙¡¢ layer

𝜕ℒ W𝜕𝑦<d

= '(

𝑤<(d?*𝜕ℒ W𝜕𝑧(d?*

𝜕ℒ W𝜕𝑧<

d =𝜕𝑦<d

𝜕𝑧<d𝜕ℒ W𝜕𝑦<

d

∀𝑗𝜕𝑦<z

𝜕𝑤<(d?* = 𝑦<d

𝜕ℒ W𝜕𝑧(

d?*


Special Cases – Vector Activations• The argument to the activation function is a

vector (instead of scalar)• Each 𝑧<d influences all 𝑦*d , … , 𝑦9d

• The number of outputs for vector activations need not be same as the number of inputs• Thus

𝜕ℒ𝜕𝑧<d

='(

𝜕ℒ𝜕𝑦(d

𝜕𝑦(d

𝜕𝑧<d

• Exercise – Work out for softmax


zd ydydC*

Special Case – Multiplicative Units• Some networks have multiplicative combination• In contrast to additive combination

𝑜<d = 𝑦(dC*𝑦5dC*

• Backward Pass – gradient computation𝜕ℒ𝜕𝑦(dC*

=𝜕𝑜<d

𝜕𝑦(dC*𝜕ℒ𝜕𝑜<d

= 𝑦5dC*𝜕ℒ𝜕𝑜<d


⨂

𝑧dC* 𝑦dC*

𝑜d𝑤d

Converting the Process Using Vector Operations• For layered networks it is generally simpler to think of the

process in terms of vector operations• Simpler arithmetic• Fast matrix libraries make operations much faster

• We can restate the entire process in vector terms• On slides, please read• This is what is actually used in any real system


Vector Form (1)


𝐱 =

𝑥*𝑥7⋮𝑥+

𝐳𝒍 =

𝑧*d

𝑧7d⋮𝑧+�d

𝐲𝒍 =

𝑦*d

𝑦7d⋮𝑦+�d

Wd =

𝑤**d 𝑤7*d ⋮ 𝑤+�©ª*d

𝑤*7d 𝑤77d ⋮ 𝑤+�©ª7d

⋯ ⋯ ⋱ ⋮𝑤*+�d 𝑤7+�

d ⋯ 𝑤+�©ª+�d

zd = Wd𝑦dC* yd = 𝑓d zd

⋮ ⋮

𝑥*

𝑥7

𝑥+

𝑤***

𝑤++ª*

𝑤+**

𝑧**

𝑧7*

𝑧+ª*

𝑦**

𝑦7*

𝑦+ª*

Forward Pass: |y = yz = 𝑓z Wz𝑓zC* WzC*𝑓zC7 …W7𝑓* x

Vector Form (2) – The Jacobian

• The derivative of a vector function w.r.t. vector input is called a Jacobian

• It is the matrix of partial derivatives

• zd = Wd𝑦dC* implies, 𝐽� yd = Wd


z𝒍 =

𝑧*d

𝑧7d⋮𝑧+�d

𝐲𝒍 =

𝑦*d

𝑦7d⋮𝑦+�d

𝐲d =

𝑦*d

𝑦7d⋮𝑦+�d

= 𝑓d𝑧*d

𝑧7d⋮𝑧+�d

= 𝑓d 𝐳d 𝐽𝐲 𝐳 =

𝜕𝑦*𝜕𝑧*

𝜕𝑦*𝜕𝑧7

⋯𝜕𝑦*𝜕𝑧+�

𝜕𝑦7𝜕𝑧*

𝜕𝑦7𝜕𝑧7

⋯𝜕𝑦7𝜕𝑧+�

⋯ ⋯ ⋱ ⋯𝜕𝑦+�𝜕𝑧*

𝜕𝑦+�𝜕𝑧7

⋯𝜕𝑦+�𝜕𝑧+�

Vector Form (3) – The Jacobian• Chain Rule for Jacobians• y = 𝑓 𝑔 x , z = 𝑔(x)• 𝐽� x = 𝐽� z 𝐽 x

• Backward Pass• ∇��ℒ = ∇®�¯ªℒ W

d?*

• ∇�ℒ = ∇��ℒ 𝐽°� zd

• ∇�±ℒ = ydC*∇�ℒ


𝜕ℒ W𝜕𝑦<d

= '(

𝑤<(d?*𝜕ℒ W𝜕𝑧(d?*

𝜕ℒ W𝜕𝑧<d

=𝜕𝑦<d

𝜕𝑧<d𝜕ℒ W𝜕𝑦<d

∀𝑗𝜕𝑦<z

𝜕𝑤<(d?*= 𝑦<d

𝜕ℒ W𝜕𝑧(d?*

In Summary - The Forward Pass• Set 𝐲% = 𝐱

• For layer 𝑙 = 1 to 𝐿:• Recursion:

zd = WdydC* + bdyd = 𝑓d zd

• Output:|y = yz


In Summary - The Backward Pass• Set y% = x, yz = |𝑦• Initialize: Compute ∇°³ℒ = ∇��ℒ

• For layer 𝑙 = 𝐿 to 1:• Compute 𝐽�� zd

• Will require intermediate values computed in the forward pass• Recursion:

∇�ℒ = ∇��ℒ 𝐽�� zd

∇��©ªℒ = ∇�ℒ Wd

• Gradient Computation:𝛻��ℒ = ydC*𝛻�ℒ𝛻´�ℒ = 𝛻�ℒ


Summary• Feedforward Networks• Architecture• Activation Functions

• Training Feedforward Networks• Gradient Descent• Automatic Differentiation• Scalar and Vector Mode• Backpropagation algorithm


Next – Issues in Training• Convergence• Training Stability• Heuristics for training


artificial neural networkscse.iitrpr.ac.in/ckn/courses/f2019/cs618/w3.pdflinear regression and using...

Documents