neural networks - cs.cmu.edu

2
Neural networks So far, a lot of our ML models have placed strict limits on the kinds of functions we can learn: e.g., linear functions or small trees. This may be OK if we have either a simple problem (e.g., if we can manually design a representation that makes the best classifier be linear) or not much data (so that we can't resolve the difference between different classes of functions). But what if we know that our problem is really complicated, so that a simple hypothesis isn't going to get maximum performance? What if we don't have enough insight to design a good representation by hand? And what if we have data to burn? Enter neural networks. A neural network is a flexible representation of a learnable function. Neural networks can be part of a wide variety of ML models: e.g., they can represent decision boundaries in classifiers, or likelihood functions in probability models. They are extremely flexible: in principle they can represent any function to arbitrary accuracy. They are surprisingly effective: despite their complexity, it is practical to learn them from data. But they are expensive: we can't hope to learn a network that contains millions of bits of information unless our training data gives us millions of bits of relevant information. That usually means we need tons of training examples — and likely also tons of computing power to process those examples. For concreteness, let's think about neural networks for a multidimensional regression problem: we have a bunch of training data points for . We use our neural network to represent a function , and we'd like the squared error to be as small as possible. Keep in mind, though, that regression is only one example of a model that we can build from a neural network. Graph structure A neural network is an annotated graph that represents a learnable function: Each node represents a number, called its activation or value. At the roots of the graph (on the left in the picture) are inputs. At the leaves (on the right) are outputs. Everything else is a latent or hidden value. Each node also represents a computation. A node's value is the result of some operations that we run locally, using information available at that node and adjacent edges. As we get deeper in the graph, each computation builds on previous ones, so that we can represent more and more complex functions. Edges represent communication. Each edge is directed, and values travel along it in the given direction. To support our computations, and to allow learning, nodes and edges can have adjustable parameters or weights: each node's computation can depend on parameters that are stored at that node or its adjacent edges. Later on, when we talk about learning, we'll set these parameters based on training data: for example, we might look at one of the training examples in our regression problem, and update our parameters to reduce the squared error on that example. A given network can have a lot of inputs, outputs, and parameters. To keep notation manageable, we can concatenate all of the network's inputs into a vector , all of the outputs into a vector , and all of the parameters into a vector . We give a name, say , to the network. We then write for the function computed by the network: the function value depends on both the inputs and the parameters, and we use a to separate and keep track of the two different kinds of arguments. Neurons Neural networks are built out of simple, reusable building blocks. Big networks just use more of each kind of block, and wire the blocks together with more edges. The simplest building block is called a node or a neuron: For this node, the inputs are . The edges hold parameters . The node itself holds the bias parameter . The icon on the node tells us that it represents the function which is the result of a two-stage computation: First we calculate a linear function using the node's inputs, weights, and bias. The number is often called the node's pre-activation. Second we apply the scalar function to get the node's final value . This function is called the node's activation function, link function, transfer function, or nonlinearity. ( Relu is short for rectified linear unit : we compute a linear function and then rectify it, meaning make it positive.) For example, suppose the inputs and weights are: What is the value of the output node? A: -1 B: 0 C: 1 D: 2 Relu is one of the most common activation functions; another one is the logistic sigmoid which should look familiar from logistic regression. But the sky is the limit: we could use trig functions, polynomials, splines, or whatever else we can imagine. Feedforward networks A feedforward network is one where the graph is acyclic (a DAG). Right now we will consider only feedforward networks; later on we'll add semantics for cycles (resulting in recurrent networks). In a feedforward network, we can compute the output values in a single forward pass: we start at the input nodes, and recursively compute each node once its immediate parents are all finished. We finally read off the values of the output nodes. For example: What is the value of the topmost output node? A: -1 B: 0 C: 1 D: 2 Layers Big networks can get complex and unwieldy to design and use. To help manage this complexity, we often add some structure: we build a big network out of smaller subnetworks. One of the most common kinds of structure is a layer . A layer is a group of nodes which all get their inputs from the same places and all send their outputs to the same places. Here is a network with layers (outlined in fluorescent green): We can number the layers based on depth: the input layer is layer zero, and it connects to layer 1, then layer 2, etc. The output layer is layer 4 of this network. Since the input layer doesn't do any computation, we'll call this a four-layer network. The number of nodes in a layer is its width; for example, the width of layer 1 in this network is 3 nodes. The layers between the input and the output are called hidden. Typically, as a network gets deeper, it gets more expressive — that is, it can represent a wider variety of functions with a given number of nodes. But, we have to take a bit more care when training it — more on this below. We can think about all of the nodes in a particular layer, and write their activations all together as a vector: if there are nodes in layer , we get a vector . We can also think about all the edges between an adjacent pair of layers, and write the corresponding weights all together as a matrix. For layers and , we'll call the matrix ; its size will be . Optionally we can also have bias weights for layer . If so, we can write them as a vector . This vector-matrix representation is really convenient since it gives us a nice shorthand for the network's computation: we can write Here is the vector of pre-activations for all the nodes in layer . The function is the activation function (e.g., ); by convention we apply it componentwise to the vector of pre-activations. A useful variant of a layer is to add a residual connection: if , we set and choose an activation function such that . This way a layer does nothing (implements the identity function) by default, and we only move away from that default if the training data provides evidence that something else is better. More building blocks Above we showed just one kind of building block. But we can turn basically any computation into a building block by labeling its inputs, outputs, and parameters. One common type of building block is normalization: we'd like the nodes in a layer to sum to 1, or to correspond to a unit vector, or satisfy some other constraint on their scale. So, we want to apply a function that enforces this constraint. An example is softmax, which maps a vector to another vector in that sums to 1: Softmax is strongly related to the logistic sigmoid function: in fact, softmax on a two-node layer is exactly the logistic sigmoid. Another example is layer normalization, which makes a layer's vector of activations have zero mean and unit variance. Layer normalization computes where We could think of a normalization computation as a parameter-less block that we can insert anywhere into our network. But, since it has no parameters of its own, it's more common to combine the normalization with other blocks. For example, a softmax layer is a linear function followed by a softmax normalization: Or, a relu layer with layer normalization is Another important building block is attention, which we'll see later. Tensors Sometimes there is structure within the nodes of a single layer. For example, it could be that our vector of activations represents an image. If it's a grayscale image, we could rearrange the vector of activations into a rectangular matrix, with width and height equal to the image width and height. For a color image we might additionally want to stack three matrices: one each for red, green, and blue pixel intensities. We can keep going: for example, maybe several color images come from a common source, and we stack those together. As we go on, we produce arrays with more and more indices; these arrays are called tensors. The index values in the last dimension are sometimes called channels, by analogy to color channels in an image. When we rearrange layers into tensors, our weight matrices also have to become tensors. Consider a one- layer network with an input image and an output image . We can arrange and into two-index tensors (i.e., matrices). It's common to make the indexing explicit for clarity: we can write For example, might look like: Our weights then become a 4-index tensor, with one pair of indices for and another pair for : For example, might look like: With this notation, we can write our layer's computation as meaning that we can compute as: As always, the activation function is applied componentwise. More compactly, we can take advantage of the Einstein summation convention (cf. numpy's einsum) and write In this convention, the summations are implicit: any index that appears both as a subscript and a superscript gets summed over, or contracted. Hyperparameters You may have gotten the impression by now that deep neural nets can be complicated. In fact, they're often so complicated that human designers don't really completely understand the effects of all of the design decisions. For example, How many layers should we use? How wide should each layer be? What pattern of connectivity for the layers? Should we use residual connections? What learning rate, or learning rate schedule? Should we use momentum? How much? Should we try to precondition, and if so, how? Should we use a method like Adam that tries to compensate automatically for poor conditioning? We'll encounter even more potential design decisions later on: e.g., adversarial objectives, new modules like convolutions, etc. For this reason, it is common to leave some design decisions unspecified, and instead search among these decisions for the best possible performance. The unspecified decisions are called hyperparameters. They can be discrete (number of units in layer 7) or continuous (momentum parameter). There are lots of sophisticated ways to try to set hyperparameters. But, a simple way that often works well enough is random search: we pick some vectors of hyperparameter settings uniformly at random within a predefined set, train a network for each vector of settings, and compare the resulting networks on a holdout set. Learning Learning a neural network means deciding on the values of all of the parameters . To do so, we can use any of the learning principles we've covered so far: maximum likelihood or conditional likelihood, regularized maximum likelihood, information measures, and so forth. Typically this leads to an optimization problem based on a training set: for example, in our regression problem, given training data , maximum conditional likelihood tells us to solve This objective has exactly the form we were discussing in the last lecture, a sum over training examples. In fact, it's a perfect problem for stochastic gradient descent: we'll often have a big training set (so we don't want to look at all of it on every optimization step) and a lot of parameters (so we can't afford to do anything that takes more than linear time in the number of parameters). Furthermore, the exploration and generalization properties of SGD turn out to complement neural networks really well. There are just two problems that we have to solve before we can use SGD to learn our parameters: initializing the network, and computing its gradients. Backpropagation The inner loop of SGD computes lots and lots of gradients for single examples, like The secret to computing these derivatives is just the chain rule. But we need to use the chain rule a lot of times to differentiate a complicated network function . Naively, it might seem difficult to keep track. Fortunately, it turns out that there's a simple and effective way to organize and keep track of millions or billions of applications of the chain rule: the backpropagation or backprop algorithm. We save the results of our forward pass, so that we know the activations and pre-activations at every node. We then run a backward pass, in reverse order from the forward pass. We start at the output of our network by computing the derivative of the loss for a single term in our objective. As we sweep backward, at each new node or edge, we compute local derivatives based on what we already know for neighboring nodes and edges. In this image, green numbers are the network weights. Brown numbers are the forward pass, starting from and ending in a prediction of . Magenta numbers are the backward pass, starting from the derivative of the error at , which is . For example, at bottom left, we see that the derivative of the network output with respect to the bottom-left edge weight is ; after we multiply by the edge weight of , this edge contributes to the derivative for the input node. Vectorized backprop If our network is bigger than the tiny example above, our code will be much more efficient if we vectorize it. That is, we want to write it in terms of vectors, matrices, and linear algebra operations, so that we can take advantage of highly-optimized linear algebra routines — possibly on data-parallel processors like GPUs. To this end, suppose we have a feedforward network with layers. Write for the entire list of parameters. Suppose we have just made a prediction for the th training example, by running a forward pass through the network: Now we want the gradient for the th term of our objective function That is, we want all of the gradient pieces , , and so forth. We can compute all of these gradient pieces with a single backward pass through the network. We start with the gradient of the loss function: This vector is the input to backpropagation. Notation: in this part of the notes, we'll use the convention that the shape of a vector or matrix derivative is (outputs inputs). So, since is the derivative of a scalar with respect to a vector of length , its shape is . This is actually the transpose of the usual convention, but it makes the notation a lot simpler in the math below. In particular, it makes the multidimensional chain rule really simple: , just like the scalar case, even when are vectors. When we use the derivatives in SGD, we can just transpose them back to the usual shape. When we process layer , we will have computed , the gradient of the loss with respect to this layer's activations . We will then use the chain rule a bunch of times: first, we will compute all of the gradient pieces for the local parameters at this layer. Then, we will compute a new vector that continues backpropagating to the previous layer. In more detail, suppose we are currently at layer , and we have just computed . We can use the chain (x , y ) i i (R , R ) n m i =1… N f (x): R n R m f (x ) i y i 2 x R n y R m θ R d f y = f (x; θ ) ; x , x , x 1 2 3 w , w , w 1 2 3 w 0 max(0, w + 0 wx + 1 1 wx + 2 2 wx ) 3 3 u = w + 0 wx + 1 1 wx + 2 2 wx 3 3 u z = relu(u) = max(0, u) z = x 1 x 2 x 3 = 1 0 2 w 0 w 1 w 2 w 3 0 3 1 2 σ (u)= 1+ e u 1 n j j z j R n j j 1 j W j n × j n j 1 j b j R n j z = 0 x z = j f (Wz + j j j 1 b ), j = j 1… L u = j Wz + j j 1 b j j f j relu n = j n j 1 z = j z + j 1 f (Wz + j j j 1 b ) j f (0) = j 0 u R n z R n z = k e k u k e u k z = k σ u μ k μ = u σ = n 1 k k 2 (u n 1 k k μ) 2 z = j softmax(Wz + j j 1 b ) j z = j layernorm(relu(Wz + j j 1 b )) j x y x y x y kpq x x y W kpq W y = pq σ W x (k kpq kl ) y σ y = pq σ (W x ) kpq kl θ (x , y ) i i (f (x ; θ ) θ min N 1 i=1 N i y ) i 2 (f (x ; θ ) dθ d i y ) i 2 f (x; θ ) x =1 = y ^ 0.381 ( y ^ 1) /2 2 = y ^ 0.381 0.619 0.619 0.25 0 0 L θ =(W , b ) j j j =1 L i z = 0 x z = i j f (Wz + j j j 1 b ) f (x ; θ )= j i z L i where = dθ di i y i z L 2 dW 1 di db 1 di g = L = dz L di 2(z L y ) i T × g L m 1 × m = dq dp du dp dq du p, q , u j g j z j g j 1 j g = d

Upload: others

Post on 01-Oct-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Neural networks - cs.cmu.edu

Neural networks

So far, a lot of our ML models have placed strict limitson the kinds of functions we can learn: e.g., linearfunctions or small trees. This may be OK if we haveeither a simple problem (e.g., if we can manually designa representation that makes the best classifier be linear)or not much data (so that we can't resolve thedifference between different classes of functions).

But what if we know that our problem is reallycomplicated, so that a simple hypothesis isn't going toget maximum performance? What if we don't haveenough insight to design a good representation byhand? And what if we have data to burn?

Enter neural networks. A neural network is a flexiblerepresentation of a learnable function. Neural networkscan be part of a wide variety of ML models: e.g., theycan represent decision boundaries in classifiers, orlikelihood functions in probability models. They areextremely flexible: in principle they can represent anyfunction to arbitrary accuracy. They are surprisinglyeffective: despite their complexity, it is practical to learnthem from data. But they are expensive: we can't hopeto learn a network that contains millions of bits ofinformation unless our training data gives us millions ofbits of relevant information. That usually means weneed tons of training examples — and likely also tons ofcomputing power to process those examples.

For concreteness, let's think about neural networks for amultidimensional regression problem: we have a bunchof training data points for .We use our neural network to represent a function

, and we'd like the squared error to be as small as possible. Keep in mind,

though, that regression is only one example of a modelthat we can build from a neural network.

Graph structure

A neural network is an annotated graph that representsa learnable function:

Each node represents a number, called its activation orvalue. At the roots of the graph (on the left in thepicture) are inputs. At the leaves (on the right) areoutputs. Everything else is a latent or hidden value.

Each node also represents a computation. A node'svalue is the result of some operations that we runlocally, using information available at that node andadjacent edges. As we get deeper in the graph, eachcomputation builds on previous ones, so that we canrepresent more and more complex functions.

Edges represent communication. Each edge is directed,and values travel along it in the given direction.

To support our computations, and to allow learning,nodes and edges can have adjustable parameters orweights: each node's computation can depend onparameters that are stored at that node or its adjacentedges. Later on, when we talk about learning, we'll setthese parameters based on training data: for example,we might look at one of the training examples in ourregression problem, and update our parameters toreduce the squared error on that example.

A given network can have a lot of inputs, outputs, andparameters. To keep notation manageable, we canconcatenate all of the network's inputs into a vector

, all of the outputs into a vector , and all ofthe parameters into a vector . We give a name,say , to the network. We then write

for the function computed by the network: the functionvalue depends on both the inputs and the parameters,and we use a to separate and keep track of the twodifferent kinds of arguments.

Neurons

Neural networks are built out of simple, reusablebuilding blocks. Big networks just use more of each kindof block, and wire the blocks together with more edges.

The simplest building block is called a node or a neuron:

For this node, the inputs are . The edges holdparameters . The node itself holds the biasparameter .

The icon on the node tells us that it represents thefunction

which is the result of a two-stage computation:

First we calculate a linear function

using the node's inputs, weights, and bias. Thenumber is often called the node's pre-activation.

Second we apply the scalar function

to get the node's final value . This function iscalled the node's activation function, link function,transfer function, or nonlinearity. (Relu is short forrectified linear unit: we compute a linear functionand then rectify it, meaning make it positive.)

For example, suppose the inputs and weights are:

What is the value of the output node?

A: -1

B: 0

C: 1

D: 2

Relu is one of the most common activation functions;another one is the logistic sigmoid

which should look familiar from logistic regression. Butthe sky is the limit: we could use trig functions,polynomials, splines, or whatever else we can imagine.

Feedforward networks

A feedforward network is one where the graph is acyclic(a DAG). Right now we will consider only feedforwardnetworks; later on we'll add semantics for cycles(resulting in recurrent networks).

In a feedforward network, we can compute the outputvalues in a single forward pass: we start at the inputnodes, and recursively compute each node once itsimmediate parents are all finished. We finally read offthe values of the output nodes.

For example:

What is the value of the topmost output node?

A: -1

B: 0

C: 1

D: 2

Layers

Big networks can get complex and unwieldy to designand use. To help manage this complexity, we often addsome structure: we build a big network out of smallersubnetworks. One of the most common kinds ofstructure is a layer.

A layer is a group of nodes which all get their inputsfrom the same places and all send their outputs to thesame places. Here is a network with layers (outlined influorescent green):

We can number the layers based on depth: the inputlayer is layer zero, and it connects to layer 1, then layer2, etc. The output layer is layer 4 of this network. Sincethe input layer doesn't do any computation, we'll callthis a four-layer network. The number of nodes in alayer is its width; for example, the width of layer 1 in thisnetwork is 3 nodes. The layers between the input andthe output are called hidden.

Typically, as a network gets deeper, it gets moreexpressive — that is, it can represent a wider variety offunctions with a given number of nodes. But, we have totake a bit more care when training it — more on thisbelow.

We can think about all of the nodes in a particular layer,and write their activations all together as a vector: ifthere are nodes in layer , we get a vector .

We can also think about all the edges between anadjacent pair of layers, and write the correspondingweights all together as a matrix. For layers and ,we'll call the matrix ; its size will be .

Optionally we can also have bias weights for layer . Ifso, we can write them as a vector .

This vector-matrix representation is really convenientsince it gives us a nice shorthand for the network'scomputation: we can write

Here is the vector of pre-activationsfor all the nodes in layer . The function is theactivation function (e.g., ); by convention we apply itcomponentwise to the vector of pre-activations.

A useful variant of a layer is to add a residualconnection: if , we set

and choose an activation function such that .This way a layer does nothing (implements the identityfunction) by default, and we only move away from thatdefault if the training data provides evidence thatsomething else is better.

More building blocks

Above we showed just one kind of building block. Butwe can turn basically any computation into a buildingblock by labeling its inputs, outputs, and parameters.

One common type of building block is normalization:we'd like the nodes in a layer to sum to 1, or tocorrespond to a unit vector, or satisfy some otherconstraint on their scale. So, we want to apply afunction that enforces this constraint.

An example is softmax, which maps a vector toanother vector in that sums to 1:

Softmax is strongly related to the logistic sigmoidfunction: in fact, softmax on a two-node layer is exactlythe logistic sigmoid.

Another example is layer normalization, which makes alayer's vector of activations have zero mean and unitvariance. Layer normalization computes

where

We could think of a normalization computation as aparameter-less block that we can insert anywhere intoour network. But, since it has no parameters of its own,it's more common to combine the normalization withother blocks. For example, a softmax layer is a linearfunction followed by a softmax normalization:

Or, a relu layer with layer normalization is

Another important building block is attention, whichwe'll see later.

Tensors

Sometimes there is structure within the nodes of asingle layer. For example, it could be that our vector ofactivations represents an image. If it's a grayscaleimage, we could rearrange the vector of activations intoa rectangular matrix, with width and height equal to theimage width and height. For a color image we mightadditionally want to stack three matrices: one each forred, green, and blue pixel intensities.

We can keep going: for example, maybe several colorimages come from a common source, and we stackthose together. As we go on, we produce arrays withmore and more indices; these arrays are called tensors.The index values in the last dimension are sometimescalled channels, by analogy to color channels in animage.

When we rearrange layers into tensors, our weightmatrices also have to become tensors. Consider a one-layer network with an input image and an outputimage . We can arrange and into two-index tensors(i.e., matrices). It's common to make the indexingexplicit for clarity: we can write

For example, might look like:

Our weights then become a 4-index tensor, with onepair of indices for and another pair for :

For example, might look like:

With this notation, we can write our layer's computationas

meaning that we can compute as:

As always, the activation function is appliedcomponentwise.

More compactly, we can take advantage of the Einsteinsummation convention (cf. numpy's einsum) and write

In this convention, the summations are implicit: anyindex that appears both as a subscript and a superscriptgets summed over, or contracted.

Hyperparameters

You may have gotten the impression by now that deepneural nets can be complicated. In fact, they're often socomplicated that human designers don't reallycompletely understand the effects of all of the designdecisions. For example,

How many layers should we use?

How wide should each layer be?

What pattern of connectivity for the layers?

Should we use residual connections?

What learning rate, or learning rate schedule?

Should we use momentum? How much?

Should we try to precondition, and if so, how?

Should we use a method like Adam that tries tocompensate automatically for poor conditioning?

We'll encounter even more potential design decisionslater on: e.g., adversarial objectives, new modules likeconvolutions, etc.

For this reason, it is common to leave some designdecisions unspecified, and instead search among thesedecisions for the best possible performance. Theunspecified decisions are called hyperparameters. Theycan be discrete (number of units in layer 7) orcontinuous (momentum parameter).

There are lots of sophisticated ways to try to sethyperparameters. But, a simple way that often workswell enough is random search: we pick some vectors ofhyperparameter settings uniformly at random within apredefined set, train a network for each vector ofsettings, and compare the resulting networks on aholdout set.

Learning

Learning a neural network means deciding on the valuesof all of the parameters . To do so, we can use any ofthe learning principles we've covered so far: maximumlikelihood or conditional likelihood, regularizedmaximum likelihood, information measures, and soforth.

Typically this leads to an optimization problem based ona training set: for example, in our regression problem,given training data , maximum conditionallikelihood tells us to solve

This objective has exactly the form we were discussingin the last lecture, a sum over training examples. In fact,it's a perfect problem for stochastic gradient descent:we'll often have a big training set (so we don't want tolook at all of it on every optimization step) and a lot ofparameters (so we can't afford to do anything that takesmore than linear time in the number of parameters).Furthermore, the exploration and generalizationproperties of SGD turn out to complement neuralnetworks really well.

There are just two problems that we have to solvebefore we can use SGD to learn our parameters:initializing the network, and computing its gradients.

Backpropagation

The inner loop of SGD computes lots and lots ofgradients for single examples, like

The secret to computing these derivatives is just thechain rule. But we need to use the chain rule a lot oftimes to differentiate a complicated network function

. Naively, it might seem difficult to keep track.

Fortunately, it turns out that there's a simple andeffective way to organize and keep track of millions orbillions of applications of the chain rule: thebackpropagation or backprop algorithm. We save theresults of our forward pass, so that we know theactivations and pre-activations at every node. We thenrun a backward pass, in reverse order from the forwardpass. We start at the output of our network bycomputing the derivative of the loss for a single term inour objective. As we sweep backward, at each newnode or edge, we compute local derivatives based onwhat we already know for neighboring nodes andedges.

In this image, green numbers are the network weights.Brown numbers are the forward pass, starting from

and ending in a prediction of . Magentanumbers are the backward pass, starting from thederivative of the error at , which is

.

For example, at bottom left, we see that the derivativeof the network output with respect to the bottom-leftedge weight is ; after we multiply by theedge weight of , this edge contributes to thederivative for the input node.

Vectorized backprop

If our network is bigger than the tiny example above, ourcode will be much more efficient if we vectorize it. Thatis, we want to write it in terms of vectors, matrices, andlinear algebra operations, so that we can takeadvantage of highly-optimized linear algebra routines —possibly on data-parallel processors like GPUs.

To this end, suppose we have a feedforward networkwith layers. Write for the entire list ofparameters. Suppose we have just made a predictionfor the th training example, by running a forward passthrough the network:

Now we want the gradient for the th term of ourobjective function

That is, we want all of the gradient pieces , , and

so forth.

We can compute all of these gradient pieces with asingle backward pass through the network. We startwith the gradient of the loss function:

This vector is the input to backpropagation.

Notation: in this part of the notes, we'll use theconvention that the shape of a vector or matrixderivative is (outputs inputs). So, since is thederivative of a scalar with respect to a vector oflength , its shape is . This is actually thetranspose of the usual convention, but it makes thenotation a lot simpler in the math below. Inparticular, it makes the multidimensional chain rulereally simple: , just like the scalar case,

even when are vectors. When we use thederivatives in SGD, we can just transpose themback to the usual shape.

When we process layer , we will have computed , thegradient of the loss with respect to this layer'sactivations . We will then use the chain rule a bunch oftimes: first, we will compute all of the gradient piecesfor the local parameters at this layer. Then, we willcompute a new vector that continuesbackpropagating to the previous layer.

In more detail, suppose we are currently at layer , andwe have just computed . We can use the chain

(x , y ) ∈i i (R , R )n m i = 1 … N

f(x) : R →n Rm

∥f(x ) −i y ∥i2

x ∈ Rn y ∈ Rm

θ ∈ Rd

f

y = f(x; θ)

;

x , x , x1 2 3

w , w , w1 2 3

w0

max(0, w +0 w x +1 1 w x +2 2 w x )3 3

u = w +0 w x +1 1 w x +2 2 w x3 3

u

z = relu(u) = max(0, u)

z

=⎝⎜⎛ x1

x2

x3 ⎠⎟⎞

=⎝⎜⎛ 1

02 ⎠⎟

⎞⎝⎜⎜⎜⎛

w0

w1

w2

w3 ⎠⎟⎟⎟⎞

⎝⎜⎜⎜⎛

031

−2 ⎠⎟⎟⎟⎞

σ(u) =1 + e−u

1

nj j z ∈j Rnj

j − 1 j

Wj n ×j nj−1

j

b ∈j Rnj

z =0 x z =j f (W z +j j j−1 b ), j =j 1 … L

u =j W z +j j−1 bj

j fj

relu

n =j nj−1

z =j z +j−1 f (W z +j j j−1 b )j

f (0) =j 0

u ∈ Rn

z Rn

z =ke∑

k′u

k′

euk

z =kσ

u − µk

µ = u σ =n1 ∑

k k2 (u −n

1 ∑k k µ)2

z =j softmax(W z +j j−1 b )j

z =j layernorm(relu(W z +j j−1 b ))j

x

y x y

x ykℓ pq

x

x y

Wkℓpq

W

y =pq σ W x(∑k

∑ℓ kℓpq kl)

y

σ

y =pq σ(W x )kℓpq kl

θ

(x , y )i i

(f(x ; θ) −θ

minN

1

i=1

∑N

i y )i2

(f(x ; θ) −dθ

di y )i

2

f(x; θ)

x = 1 =y 0.381

( −y 1) /22 =y 0.381

0.619

−0.619 ⋅ 0.25

0 0

L θ = (W , b )j j j=1L

i

z =0 x z =i j f (W z +j j j−1 b ) f(x ; θ) =j i zL

i

where ℓ =dθ

dℓii ∥y −i z ∥L

2

dW1

dℓi

db1

dℓi

g =L =dzL

dℓi 2(z −L y )iT

× gL

m 1 × m

=dqdp

dudp

dqdu

p, q, u

j gj

zj

gj−1

j

g = dℓ

Page 2: Neural networks - cs.cmu.edu

we have just computed . We can use the chain

rule to find:

The first two equations are the local gradient pieces; thelast is the vector that we keep propagating backward.All of these expressions use only (which we alreadyhave) and the local derivatives

To get the local derivatives, write

so that

The first factor is the same for all three local derivatives: is the derivative of the activation function. Since this

is a componentwise function (each component of theoutput depends only on the corresponding componentof the input), the derivative will be a diagonal matrix —call it . For example, if

and if we are using relu activations, then we get that thederivative is

since the derivative of relu is either 1 or 0, depending onwhether its input is positive or negative.

Aside: an activation function like softmax isn'tcomponentwise. But its derivative is still simple tocalculate. There's nothing special about a diagonal

, aside from a slight decrease in computationalcost.

The second factor in each of these expressions is thederivative of a linear function: two of them are easy,

The last expression, , is a bit tricky: since it's the

derivative of a vector with respect to a matrix, it'stechnically a three-index tensor (two indices for theinput and one for the output). The simplest way to dealwith this tensor is to avoid it: we can use the linearity ofderivatives together with our above calculations tocompute the entire gradient we're interested in at once,

Now we can substitute in the definition of and takethe derivative:

Since is a column vector and is a row vector, thisis a rank-1 matrix.

Putting it all together, we save all of the activations andpre-activations and from our forward pass. Weinitialize with the derivative of a single loss functionterm . Then at each layer , we start from , andcalculate:

The final equation above gives us , so we cancontinue on to the previous layer, sweeping backwardover the network and recursively computing all of thederivative pieces we need for SGD.

Pretraining

Now that we've looked at computing derivatives, thenext question is how to initialize SGD. The easiest wayto initialize is to get someone else to do it for us! If ourinputs represent a common data type like text orimages, there are a number of pre-trained networksavailable to download. They won't have been trained onexactly our task, but they might have been trained onsomething close enough to be useful — and often with alot of data.

Given a pre-trained network, there are two good ways touse it. The first is to fine-tune it: we run some iterationsof SGD on it with our own loss function, often with afairly conservative learning rate.

The second is to train only the last layer: we fix all of theweights in layers , and train only . Thisresults in a linear learning problem: the first layersare effectively a complicated way to compute featuresof our input data. So for example, if our objectivefunction is minimum squared error, we wind up withlinear regression.

The key advantage of the last layer method is efficiency,both statistical and computational. The last layeractivations don't change, and so we only have tocompute them once for each training example. Afterthat we get to completely forget about most of thenetwork, which can be a big savings, especially onmemory-limited devices. If our objective is convex, wecan use fast convex solvers to find the last layer weights— maybe as simple as solving a set of linear equations.And, if our training data is limited, we are learning amuch less complex function, and so we might be able togeneralize better.

If we can't convince someone else to pretrain for us, wecan also pretrain a network ourselves. This is often agood idea when we have limited data for the problemwe care about, but a lot of data for a related problem.

Of course, before we can pretrain, we have to figure outhow to initialize a network on our own.

Initialization

If we initialize the parameters of our network badly, wecan make our lives very difficult. In a poorly initializednetwork, the starting function might oscillate rapidlybetween , or be essentially zero, or somecombination of both. Many of the parameters might notinfluence the learned function at all, while for otherparameters, a tiny tweak might change the learnedfunction so much as to become almostunrecognizeable. If we put a poorly initialized networkinto an optimization problem, the conditioning can behorrible: the second derivative in some directions couldbe near zero while in others it is a numerical overflow.

On the other hand, a well-initialized network can solve alot of our problems before even a single step of training.There are even some uses of neural networks that stopafter initialization and don't bother with training at all.Maybe more useful, we can initialize the network andthen train only the last layer weights; this strategy canbe surprisingly effective.

Most common initialization methods use randomly-generated weights. The question, of course, is whatdistribution to pick from. To understand this choice, thekey piece of information we need is the recursiveexpression we derived above for backpropagation:

That is, to backpropate a gradient, we multiply it by adiagonal matrix of activation derivatives and then by aweight matrix.

As a thought experiment, what happens if the norm of is large? Then will tend to be bigger than . If

this happens on every layer, we could have a problem: ifthe norm of the gradient doubles on every layer, andthere are 50 layers, then our gradient norm might be

at the first layer. This is called exploding gradients.

On the other hand, what happens if the norm of issmall? We get the opposite problem, called vanishinggradients: the gradient norm might be at the firstlayer.

Even worse, we might get a mix of these two problems:the gradients might blow up in some directions, andsimultaneously collapse in others. The deeper ournetwork, the worse these problems can be: gradientswill grow or shrink exponentially fast in the depth of thenetwork. For this reason, when neural networks werefirst introduced, many experts thought that deepnetworks were essentially impossible to train.

What precisely makes exploding/vanishing gradientsa problem?

To avoid these problems, we want to initialize so thatthe gradients have about the same magnitude at everylayer. For this to happen, the matrix should leavethe norm of approximately unchanged — that is,

should be approximately orthonormal.

Let's look at the effect of first. Its scaling will dependon the slope of our activation function: for example, thesigmoid activation function has slope near zero, andthe relu activation function has average slope nearzero. To compensate for the scaling from , we canadjust : for example, with sigmoid activations, we'dlike to be approximately 4 times bigger.

At this point, all that's left is to initialize . Perhaps themost common choice is to set every element of tobe an independent Gaussian random variable: if we setthe variance of each entry to , then the largest

singular value of will be about . We can then scaleto compensate for the slope of our activation function.

This independent Gaussian initialization is much betterthan nothing, but it actually isn't great: while it gets thenorm (the largest singular value) of right,unfortunately may be far from orthonormal (thesmallest singular value may be too small). That meansthat there will be some directions in parameter spacethat barely affect the output of our network, leading topoor conditioning.

For moderate-sized matrices, it's actually not hard togenerate random weights that are exactly orthonormal.(See scipy.linalg.orth and numpy.random.randn .)For huge matrices this may be costly; but even herethere are shortcuts. A good one is the so-calledfastfood method.

There are two last bits of bookkeeping. First, we have toinitialize the bias weights . It's typically OK just to setthese to zero. Second, we might want to scale the firstlayer weights, or add a first layer bias term, tocompensate for any badly shifted or scaled inputcoordinates. This serves two purposes: first, it keepsbig inputs from overwhelming smaller ones. Second, foractivations like sigmoid, it prevents saturation: i.e., itkeeps the inputs in the range where the sigmoid isapproximately linear.

Aside: why now?

Many of the ingredients in this lecture — SGD, deepnetworks, backprop, etc. — have been known fordecades. So why didn't neural networks take off sospectacularly until the last 10 years or so?

The typical answer is that we needed to wait for twofactors to be in place:

big data: it's only been recently that internetcompanies could put together huge training sets ata reasonable cost

compute: we had to wait for modern GPUs

But these aren't the whole story: today we can train anetwork that generalizes well from a small data set anddoesn't take a ridiculous amount of compute power. So,here are a few more factors that could also havecontributed:

business model: we now have companies thatdepend much more strongly on technologies likepersonalization, where more flexible models andbetter predictions translate directly to higher profits

intuition: current wisdom on how to train neuralnetworks sounds really counterintuitive based onwhat we knew even 10 years ago, and it took us alla while to overcome our preconceptions and figureout what is possible

initialization: we didn't figure out successful waysto initialize anything more than tiny networks untilabout 10 years ago, and badly-initialized networksare essentially useless

Expressiveness

We said above that neural networks can representessentially any function. Let's make that precise:

For every measurable function andevery , there exists another function

such that (meaning that they differ by no

more than on any input)

is the output of a neural network built onlyfrom relu nodes.

This is called the universal approximation theorem. Theintuition for the proof is actually pretty simple. We candefine a local "bump" function easily with a smallnetwork:

We can make the plateau be any height we want, andwe can make the ramps up and down as steep asnecessary.

We can make the shape of the bump approximate anyconvex set: a convex set is equal to the intersection of abunch of halfspaces. We can represent each halfspaceeasily, as well as count how many halfspaces containthe input point.

Then we can threshold the count with a relu like .

This means we can approximate any piecewise-constant function arbitrarily well: we cover each piecewith convex sets, and set the output value in each setindependently using a bump function.

But now we can handle all measurable functions.Measurable functions are precisely the ones that wecan represent as the limit of a sequence of piecewise-constant functions — so if we go farther and fartheralong the sequence, and approximate each newpiecewise-constant function better and better, we canrepresent our target function as accurately as desired.

Note that the resulting network might be huge, and itmight be impossible to learn. But it exists.

A similar theorem is true for many other types ofnetworks: e.g., we can replace relu with sigmoidactivations, or we can limit the depth of the network.Even a single hidden layer is enough for universalapproximation; in fact, it's enough even if we have to setthe input-to-hidden weights randomly instead oftraining them. (That last result is called random kitchensinks.)

The theorem above doesn't place any bounds on thesize of the network, but there are more refined versionsthat do, using a variety of different assumptions.Loosely, the deeper a network is, the more expressive:we need fewer total nodes to approximate a typicalfunction.

There are even theorems that guarantee learnability viaSGD, given a big enough network and a big enoughtraining set. E.g., one recent result (the neural tangentkernel) shows learnability if we assume that the targetfunction is close to an element of a particularreproducing kernel Hilbert space. We'll cover RKHSslater in the course, but this is a really rich class offunctions. For example, this result guarantees that wecan learn any measurable function, though reallycomplicated functions might require huge networks andridiculous amounts of data.

All the above is not to say that we completelyunderstand the expressiveness and learnability of deepneural networks — far from it, in fact. We know thatwe're missing some understanding both qualitativelyand quantitatively: that is, we think that our bounds areloose even for the effects we know about, and also thatthere are some entire classes of effects that we aremissing.

g =j dzj

dℓi

=dbj

dℓi =dzj

dℓi

dbj

dzjgj

dbj

dzj

=dWj

dℓi =dzj

dℓi

dWj

dzjgj

dWj

dzj

g =j−1 =dzj−1

dℓi =dzj

dℓi

dzj−1

dzjgj

dzj−1

dzj

gj

dzj−1

dzj

dWj

dzj

dbj

dzj

z =j f (u ) u =j j j W z +j j−1 bj

=dbj

dzj

duj

dzj

dbj

duj

=dWj

dzj

duj

dzj

dWj

duj

=dzj−1

dzj

duj

dzj

dzj−1

duj

duj

dzj

Dj

u =j ⎝⎜⎛ 2

3−1 ⎠⎟

D =j ⎝⎜⎛ 1

00

010

000 ⎠⎟

Dj

=dbj

dujI

=dzj−1

dujWj

dWj

duj

=dWj

dℓig =j

dWj

dzjg Dj j

dWj

duj

= g D udWj

dj j j

uj

g D (W z +dWj

dj j j j−1 b ) =j z g Dj−1 j j

zj−1 gj

zj uj

gL

ℓi j gj

D =j f (u )j′

j

=dbj

dℓig Dj j

=dWj

dℓiz g Dj−1 j j

=dzj−1

dℓig D Wj j j

gj−1

1 … L − 1 W , bL L

L − 1

zL−1

±1012

g =j−1 g D Wj j j

D Wj j gj−1 gj

1015

D Wj j

10−15

D Wj j

gj

D Wj j

Dj

41

21

Dj

Wj

Wj

Wj

Wj

nj−1

1

Wj 1

Wj

Wj

bj

f : R →n Rm

ϵ > 0

f :ϵ R →n Rm

∣f − f ∣ ≤ϵ ϵ

ϵ

max(0, h +1 h +2 h −3 2.9)