neuralnetworks · 2020. 11. 24. · dr. guangliang chen | mathematics & statistics, san josé state...

60
San José State University Math 251: Statistical and Machine Learning Classification Neural Networks Dr. Guangliang Chen

Upload: others

Post on 28-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • San José State University

    Math 251: Statistical and Machine Learning Classification

    Neural Networks

    Dr. Guangliang Chen

  • Outline of the presentation:

    • Overview

    – What is a neural network

    – What is a neuron

    • Perceptrons

    • Sigmoid neurons network

    • Summary

  • Neural Networks

    Acknowledgments

    This presentation is based on the following references:

    • Nielsen’s online book “Neural Networks and Deep Learning”1

    • Olga Veksler’s lecture on neural networks2

    1http://neuralnetworksanddeeplearning.com2http://www.csd.uwo.ca/courses/CS9840a/Lecture10_NeuralNets.pdf

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 3/60

  • Neural Networks

    What is an artificial neural network?

    bbb

    x1

    x2

    xd

    y2

    y1

    yk

    bbb

    Input layer Hidden layer(s) Output layer The leftmost layer inputs features xi.

    The rightmost layer outputs predictionsyi.

    The solid circles represent neurons,which process inputs from precedinglayer and output results for next layer.

    The neural network is called a deepnetwork if it has more than one layer(otherwise, it is said to be shallow).

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 4/60

  • Neural Networks

    ANN for MNIST handwritten digits recognition

    bbb

    x1

    x2

    xd

    1

    0

    9

    bbb

    Input layer Hidden layer(s) Output layer

    784 pixels10 classes

    abstraction

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 5/60

  • Neural Networks

    What is a biological neuron?• Neurons (or nerve cells) are special cells that process and transmit infor-

    mation by electrical signaling (in brain and also spinal cord)

    • Human brain has around 1011 neurons

    • A neuron connects to other neurons to form a network

    • Each neuron cell communicates to between 1000 and 10,000 other neurons

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 6/60

  • Neural Networks

    Components of a biological neuron

    dendrites:

    • “input wires”, receive inputsfrom other neurons

    • a neuron may have thousands ofdendrites, usually short

    cell body: computational unit

    axon:

    • “output wire”, sends signal toother neurons

    • single long structure (up to 1 m)

    • splits in possibly thousands ofbranches at the end

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 7/60

  • Neural Networks

    Artificial neurons are mathematical functions

    x1

    x2

    xd

    w1

    w2

    wd

    b

    f

    bbb

    f(w · x + b)

    In the above,

    • wi: weights, b: bias, and f : activation function

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 8/60

  • Neural Networks

    Two simple activation functions• Heaviside step function: H(z) = 1z>0

    • Sigmoid: σ(z) = 11+e−z

    z-3 -2 -1 0 1 2 3

    0

    0.5

    1Heaviside H(z)Sigmoid σ(z)

    The corresponding neurons are called perceptrons and sigmoid neurons.

    We will mention several other activation functions in the end.

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 9/60

  • Neural Networks

    ANN is a composition of functions!

    • Each neuron is a function

    • It accepts inputs from previouslayer and outputs for next layer

    It can be proved that every continuousfunction can be implemented with 1hidden layer (containing enough hiddenunits) and proper nonlinear activationfunctions.

    This is more of theoretical than practicalinterest.

    bbb

    x1

    x2

    xd

    y2

    y1

    yk

    bbb

    Input layer Hidden layer(s) Output layer

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 10/60

  • Neural Networks

    How to train ANNs in principle1. Select an activation function for all neurons.

    2. Tune weights and biases at all neurons to match prediction and truth “asclosely as possible”:

    • formulate an objective or loss function L

    • optimize it with gradient descent

    – the technique is called backpropagation

    – lots of notation due to complex from of gradient

    – lots of tricks to get gradient descent work reasonably well

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 11/60

  • Neural Networks

    PerceptronsA perceptron is a neuron whose activation function is the Heaviside step function.It defines a linear, binary classifier (not necessarily optimal).

    x1

    x2

    xd

    w1

    w2

    wd

    b

    bbb

    sgn(w · x + b)

    b

    b

    b

    b

    bb

    b

    b

    b

    b

    bb

    b-11

    b

    b

    w · x + b = 0

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 12/60

  • Neural Networks

    Derivation of the Perceptron loss function

    • If a point xi is misclassified, thenyi(w ·xi + b) < 0 (implying that−yi(w · xi + b) > 0, which canbe regarded as penalty/loss).

    ⊗b

    b

    b

    b⊗

    b

    b

    b

    bb

    b

    w · x + b = 0

    -11

    b

    b

    • Denote the set of misclassifiedpoints byM.

    • The goal is to minimize the totalloss

    `(w, b) = −∑i∈M

    yi(w · xi + b)

    • If ` gets to zero, we know wehave the best possible solution(M empty → no training error)

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 13/60

  • Neural Networks

    How to minimize the perceptron loss

    The perceptron loss contains a discrete object (i.e. M) that depends on thevariables w, b, making it hard to solve analytically.

    To obtain an approximate solution, use gradient descent:

    • Initialize weights w and bias b (which would determineM).

    • Iterate until stopping criterion is met (see next page):

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 14/60

  • Neural Networks

    – Given M: The gradient may be computed as follows

    ∂`

    ∂w = −∑i∈M

    yixi

    ∂`

    ∂b= −

    ∑i∈M

    yi

    We then update w, b as follows:

    w←− w + ρ∑i∈M

    yixi

    b←− b+ ρ∑i∈M

    yi

    where ρ > 0 is a parameter, called learning rate.

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 15/60

  • Neural Networks

    – Given w, b: updateM as the set of new errors:

    M = {1 ≤ i ≤ n | yi(w · xi + b) < 0}

    ⊗b

    b

    b

    b⊗

    b

    b

    b

    bb

    b-11

    b

    b

    b

    b

    b

    b

    bb

    b

    b

    b

    b

    bb

    b-11

    b

    b

    gradient descent

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 16/60

  • Neural Networks

    Stochastic gradient descentThe gradient descent approach we use so far assumes that we have access to thefull training set, and uses all training data to iteratively update weights and bias.

    This may be slow for large data sets, or impractical in the setting of online learning(where data comes sequentially).

    A variant of gradient descent, called stochastic gradient descent, uses

    • only a single training point, or

    • a small subset of examples (called mini-batch),

    each round to update weights and bias.

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 17/60

  • Neural Networks

    • Single-sample update rule:

    – Start with a random hyperplane (with corresponding w and b)

    – Randomly select a new point xi from the training set: if it lies onthe correct side, no change; otherwise update

    w←− w + ρyixib←− b+ ρyi

    – Repeat until all examples have been visited (this is called an epoch)

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 18/60

  • Neural Networks

    • Mini-batch update rule:

    – Divide training data into mini-batches (of size 5, or 10), and updateweights after processing each mini-batch

    w←− w +∑

    ρyixi

    b←− b+ ρ∑

    yi

    – Middle ground between single sample and full training set

    – One iteration over all mini-batches is called an epoch

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 19/60

  • Neural Networks

    Comments on stochastic gradient descent

    • Single-sample update rule applies to online learning

    • Faster than full gradient descent, but maybe less stable

    • Mini-batch update rule might achieve some balance between speed andstability

    • May find only a local minimum (the hyperplane is trapped in a suboptimallocation)

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 20/60

  • Neural Networks

    Some remarks about the Perceptron algorithm• If the classes are linearly separable, the algorithm converges to a separating

    hyperplane in a finite number of steps, but not necessarily optimal.

    • The number of steps can be very large. The smaller the margin (betweenthe classes), the longer it takes to find it.

    • When the data are not separable, the algorithm will not converge, andcycles develop (which can be long and therefore hard to detect).

    • It is thus not a good classifier, but it is conceptually very important (neuron,loss function, gradient descent).

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 21/60

  • Neural Networks

    Multilayer perceptrons (MLP)

    bbb

    x1

    x2

    xd

    y2

    y1

    yk

    bbb

    Input layer Hidden layer(s) Output layer MLP is a network of perceptrons.

    However, each perceptron has a dis-crete behavior, making its effect onlatter layers hard to predict.

    Next we will look at the network of sig-moid neurons.

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 22/60

  • Neural Networks

    Sigmoid neurons

    Sigmoid neurons are smoothed-out (orsoft) versions of the perceptrons:

    A small change in any weight or biascauses only a small change in the out-put.

    We say the neuron is in low (high) acti-vation if the output is near 0 (1).

    When the neuron is in high activation,we say that it fires.

    σ(w · x + b) = 11 + e−(w·x+b)

    x1

    x2

    xd

    w1

    w2

    wd

    b

    bbb

    σ(w · x+ b)

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 23/60

  • Neural Networks

    The sigmoid neurons networkThe output of such a network continuously depends on its weights and biases (soeverything is more predictable comparing to the MLP).

    bbb

    x1

    x2

    xd

    y2

    y1

    yk

    bbb

    Input layer Hidden layer(s) Output layer

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 24/60

  • Neural Networks

    How do we train a neural network?

    • Notation, notation, notation

    • Backpropagation

    • Practical issues and solutions

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 25/60

  • Neural Networks

    Notation

    For each ` = 1, . . . , L:

    - w`jk: layer `, “j back to k” weight;

    - b`j : layer `, neuron j bias

    - a`j : layer `, neuron j output

    - z`j =∑

    k w`jka

    `−1k +b`j : weighted input

    to neuron j in layer `

    Note that a`j = σ(z`j).

    b b b

    neuron k

    b bb

    layer ℓ− 1 layer ℓ layer L

    neuron jwℓjk

    bℓj

    zℓj aℓj

    σ

    neuron j

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 26/60

  • Neural Networks

    Notation (vector form)

    W` =(w`jk

    )j,k

    : matrix of all weightsbetween layers `− 1 and `;

    b` =(b`j)

    j: vector of biases in layer `

    z` =(z`j)

    j: vector of weighted inputs

    to neurons in layer `

    a` =(a`j)

    j: vector of outputs from

    neurons in layer `

    We write a` = σ(z`)(componentwise).

    b b b

    neuron k

    b bb

    layer ℓ− 1 layer ℓ layer L

    neuron jwℓjk

    bℓj

    aℓ−1σ zℓ aℓσ

    Wℓ,bℓ

    b bbaL

    b b b

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 27/60

  • Neural Networks

    The feedforward relationship

    First note that

    • Input layer is indexed by ` = 0so that a0 = x.

    • aL is the network output.

    For each 1 ≤ ` ≤ L,

    z` = W`a`−1 + b`

    a` = σ(z`)

    b b b

    neuron k

    b bb

    layer ℓ− 1 layer ℓ layer L

    neuron jwℓjk

    bℓj

    aℓ−1σ zℓ aℓσ

    Wℓ,bℓ

    b bbaL

    b b b

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 28/60

  • Neural Networks

    The network lossTo tune the weights and biases of a network of sigmoid neurons, we need to selecta loss function.

    We first consider the square loss due to its simplicity

    C({W`,b`}1≤`≤L) =1

    2n

    n∑i=1‖aL(xi)− yi‖2

    where

    • aL(xi) is the network output when inputing a training example xi.

    • yi is the training label (coded by a vector).

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 29/60

  • Neural Networks

    Remark. In our setting, the labels are coded as follows:

    digit 0 =

    10...0

    , digit 1 =

    01...0

    , . . . ,digit 9 =

    00...1

    Therefore, by varying the weights and biases, we try to minimize the differencebetween each network output aL(xi) and one of the vectors above (associatedto the training class that xi belongs to).

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 30/60

  • Neural Networks

    Gradient descentThe network loss has too many variables to be minimized analytically:

    C({W`,b`}1≤`≤L) =1

    2n

    n∑i=1‖aL(xi)− yi‖2

    We’ll use gradient descent to attack the problem. However, computing all thepartial derivatives ∂C

    ∂w`jk

    , ∂C∂b`

    j

    is highly nontrivial.

    To simplify the task a bit, we consider a sample of size 1 consisting of only xi:

    Ci({W`,b`}1≤`≤L) =12‖a

    L(xi)− yi‖2 =12∑

    j

    (aLj − yi(j))2

    which is enough as ∂C∂w`

    jk

    = 1n∑

    i∂Ci

    ∂w`jk

    and ∂C∂b`

    j

    = 1n∑

    i∂Ci∂b`

    j

    .

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 31/60

  • Neural Networks

    The output layer first

    We start by computing ∂Ci∂wL

    jk

    , ∂Ci∂bL

    j

    asthey are the easiest.

    b b b

    neuron k

    layer L− 1 layer L (output layer)

    neuron j

    wLjk

    bLj

    aLj

    aL1

    b

    b

    b

    Ci

    b b b

    neuron k

    b bb

    layer ℓ− 1 layer ℓ layer L

    neuron jwℓjk

    bℓj

    aℓ−1σ zℓ aℓσ

    Wℓ,bℓ

    b bbaL

    b b b

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 32/60

  • Neural Networks

    Computing ∂Ci∂wLjk

    , ∂Ci∂bLj

    for the output layer

    By chain rule we have

    ∂Ci∂wLjk

    = ∂Ci∂aLj

    ·∂aLj∂wLjk

    where ∂Ci∂aL

    j

    = aLj − yi(j) for square lossand

    ∂aLj∂wLjk

    =∂aLj∂zLj

    ·∂zLj∂wLjk

    = σ′(zLj )aL−1k

    which is obtained by applying chain ruleagain with the formula for aLj .

    b b b

    neuron k

    layer L− 1 layer L (output layer)

    neuron j

    wLjk

    bLj

    aLj

    aL1

    b

    b

    b

    Ci

    aLj = σ(zLj), zLj =

    ∑k′

    wLjk′aL−1k′ +b

    Lj

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 33/60

  • Neural Networks

    Computing ∂Ci∂wLjk

    , ∂Ci∂bLj

    for the output layer

    Combining results gives that

    ∂Ci∂wLjk

    = ∂Ci∂aLj

    ·∂aLj∂wLjk

    =(aLj − yi(j)

    )σ′(zLj )aL−1k .

    Similarly, we obtain that

    ∂Ci∂bLj

    = ∂Ci∂aLj

    ·∂aLj∂bLj

    =(aLj − yi(j)

    )σ′(zLj ).

    b b b

    neuron k

    layer L− 1 layer L (output layer)

    neuron j

    wLjk

    bLj

    aLj

    aL1

    b

    b

    b

    Ci

    aLj = σ(zLj), zLj =

    ∑k′

    wLjk′aL−1k′ +b

    Lj

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 34/60

  • Neural Networks

    Interpretation of the formula for ∂Ci∂wLjk

    Observe that the rate of change of Ciw.r.t. wLjk depends on three factors (∂Ci∂bL

    j

    only depends on the first two):

    • aLj − yi(j): how much currentoutput is off from desired output

    • σ′(zLj ): how fast the neuron re-acts to changes of its input

    • aL−1k : contribution from neuronk in layer L− 1

    b b b

    neuron k

    layer L− 1 layer L (output layer)

    neuron j

    wLjk

    bLj

    aLj

    aL1

    b

    b

    b

    Ci

    Thus, wLjk will learn slowly if the inputneuron is in low-activation (aL−1k ≈ 0),or the output neuron has “saturated”,i.e., is in either high- or low-activation(in both cases σ′(zLj ) ≈ 0).

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 35/60

  • Neural Networks

    What about layer L− 1 (and further inside)?

    b b b

    neuron k

    layer L− 1 output layer

    neuron j

    wL−1kq

    bL−1k

    aLj

    aL1

    b

    b

    b

    Cineuron q

    layer L− 2

    aL−1k

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 36/60

  • Neural Networks

    b b b

    neuron k

    layer L− 1 output layer

    neuron j

    wL−1kq

    bL−1k

    aLj

    aL1

    b

    b

    b

    Cineuron q

    layer L− 2

    aL−1k

    By chain rule,

    ∂Ci

    ∂wL−1kq=∑

    j

    ∂Ci∂aLj

    ∂aLj

    ∂wL−1kq=∑

    j

    ∂Ci∂aLj

    ∂aLj

    ∂aL−1k

    ∂aL−1k∂wL−1kq

    where

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 37/60

  • Neural Networks

    b b b

    neuron k

    layer L− 1 output layer

    neuron j

    wL−1kq

    bL−1k

    aLj

    aL1

    b

    b

    b

    Cineuron q

    layer L− 2

    aL−1k

    • ∂Ci∂aL

    j

    : already computed (in the output layer);

    • ∂aLj

    ∂aL−1k

    = σ′(zLj )wLjk: link between layers L and L− 1 ;

    • ∂aL−1k

    ∂wL−1kq

    : similarly computed as in the output layer

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 38/60

  • Neural Networks

    b b b

    neuron k

    layer L− 1 output layer

    neuron j

    aLj

    aL1

    b

    b

    b

    Ci

    layer ℓ + 1

    aL−1k

    neuron p

    wℓqr

    bℓqneuron q

    layer ℓ

    aℓ+1p

    aℓq

    As we move further inside the network (from the output layer), we will need tocompute more and more links between layers:

    ∂Ci∂w`qr

    =∑

    p,...,k,j

    ∂a`q∂w`pq

    ∂a`+1p∂a`q

    . . .∂aLj

    ∂aL−1k

    ∂Ci∂aLj

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 39/60

  • Neural Networks

    The backpropagation algorithmThe products of the link terms may be computed iteratively from right to left,leading to an efficient algorithm for computing all ∂Ci

    ∂w`jk

    , ∂Ci∂b`

    j

    (based on only xi):

    • Feedforward xi to obtain all neuron inputs and outputs:

    a0 = xi; a` = σ(W`a`−1 + b`), for ` = 1, . . . , L

    • Backpropagate the network to compute

    ∂aLj∂a`q

    =∑

    p,...,k

    ∂a`+1p∂a`q

    . . .∂aLj

    ∂aL−1k, for ` = L, . . . , 1

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 40/60

  • Neural Networks

    The backpropagation algorithm (cont’d)• Compute ∂Ci

    ∂w`qr, ∂Ci

    ∂b`qfor every layer ` and every neuron q or pair of neurons

    (q, r) by using

    ∂Ci∂w`qr

    =∑

    j

    ∂a`q∂w`qr

    ·∂aLj∂a`q· ∂Ci∂aLj

    ∂Ci∂b`q

    =∑

    j

    ∂a`q∂b`q·∂aLj∂a`q· ∂Ci∂aLj

    Note that ∂Ci∂aL

    j

    only needs to computed once.

    Remark. The entire backpropagation process can be vectorized, thus can beimplemented efficiently.

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 41/60

  • Neural Networks

    Stochastic gradient descent• Initialize all the weights w`jk and biases b`j ;

    • For each training example xi,

    – Use backpropagation to compute the partial derivatives ∂Ci∂w`

    jk

    , ∂Ci∂b`

    j

    – Update the weights and biases by:

    w`jk ←− w`jk − η ·∂Ci∂w`jk

    , b`j ←− b`j − η ·∂Ci∂b`j

    This completes one epoch in the training process.

    • Repeat the preceding step until convergence.

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 42/60

  • Neural Networks

    Remark. The previous procedure uses single-sample update rule (one trainingtime each time). We can also use mini-batches {xi}i∈B to perform gradientdescent (for faster speed):

    • For every i ∈ B, use backpropagation to compute the partial derivatives∂Ci

    ∂w`jk

    , ∂Ci∂b`

    j

    • Update the weights and biases by:

    w`jk ←− w`jk − η ·1|B|

    ∑i∈B

    ∂Ci∂w`jk

    ,

    b`j ←− b`j − η ·1|B|

    ∑i∈B

    ∂Ci∂b`j

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 43/60

  • Neural Networks

    Software for neural networksMATLAB: Neural networks is not part of the MATLAB Statistics and MachineLearning Toolbox; there is a separate Deep Learning Toolbox.

    Python: Nielson has written from scratch excellent Python codes exactlyfor MNIST digits classification, which is available at https://github.com/mnielsen/neural-networks-and-deep-learning/archive/master.zip.

    Otherwise, you can directly use the Python MLP function available at https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html.

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 44/60

  • Neural Networks

    MATLAB demonstration

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 45/60

  • Neural Networks

    Nielson’s Python codes for neural networks

    # load MNIST data into pythonimport mnist_loadertraining_data, validation_data, test_data = mnist_loader.load_data_wrapper()

    # define a 3-layer neural network with number of neurons on each layerimport networknet = network.Network([784, 30, 10])

    # execute stochastic gradient descent over 30 epochs and with mini-batchesof size 10 and a learning rate of 3net.SGD(training_data, 30, 10, 3.0, test_data=test_data)

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 46/60

  • Neural Networks

    Practical issues and techniques for improvementWe have covered the main ideas of neural networks. There are a lot of practicalissues to consider:

    • How to fix learning slowdown

    • How to avoid overfitting

    • How to initialize the weights and biases for gradient descent

    • How to choose the hyperparameters, such as the learning rate, regularizationparameter, and configuration of the network, etc.

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 47/60

  • Neural Networks

    The learning slowdown issue with square lossConsider for simplicity a single sigmoid neuron

    x1

    x2

    xd

    w1

    w2

    wd

    b

    bbb

    σ(w · x+ b)

    The total input and output are z = w · x + b and a = σ(z), respectively.

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 48/60

  • Neural Networks

    Under the square loss C(w, b) = 12 (a− y)2 we obtain that

    ∂C

    ∂wj= (a− y) ∂a

    ∂wj= (a− y)σ′(z)xj

    ∂C

    ∂b= (a− y)∂a

    ∂b= (a− y)σ′(z)

    When z is initially large in magnitude, σ′(z) ≈ 0. This shows that both wj , b willinitially learn very slowly (which could be good or bad):

    wj ←− wj − η · (a− y)σ′(z)xj ,b←− b− η · (a− y)σ′(z).

    Therefore, the σ′(z) term may cause a learning slowdown when the initial weightedinput z is large in the wrong direction.

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 49/60

  • Neural Networks

    How to fix the learning slowdown issueFirst method: Use the logistic loss (also called the cross-entropy loss) instead

    C(w, b) = −y log(a)− (1− y) log(1− a)

    With this loss, we can show that the σ′(z) term is gone:

    ∂C

    ∂wj= (a− y)xj

    ∂C

    ∂b= a− y

    so that gradient descent will move fast when a is far from y.

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 50/60

  • Neural Networks

    Second method: Add a “softmax output layer” with log-likelihood cost (seeNielson’s book, Chapter 3):

    • Define a new type of output layer by changing the activation function fromsigmoid to softmax:

    aLj = σ(zLj ) −→ aLj =ez

    Lj∑

    k ezL

    k

    • Use the log-likelihood cost (instead of logistic loss)

    C =∑

    Ci, Ci = − log aLyi

    Remark. In general, both techniques (a sigmoid output layer and cross-entropy,or a softmax output layer and log-likelihood) work similarly well. One advantageof the softmax layer is the interpretation of its outputs aLj as probabilities.

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 51/60

  • Neural Networks

    Nielson’s implementation for neural networks withcross-entropy loss

    # define a 3-layer neural network with cross-entropy costimport network2net = network2.Network([784, 30, 10],cost=network2.CrossEntropyCost)

    # stochastic gradient descentnet.large_weight_initializer()net.SGD(training_data, 30, 10, 0.5, evaluation_data=test_data,monitor_evaluation_accuracy=True)

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 52/60

  • Neural Networks

    How to avoid overfittingNeural networks due to their many parameters are likely to overfit especially whengiven insufficient training data.

    Like regularized logistic regression, we can add a regularization term of the form

    λ

    2n∑j,k,`

    |w`jk|p

    to any cost function used in order to avoid overfitting.

    Typical choices of p are p = 2 (L2-regularization) and p = 1 (L1-regularization)

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 53/60

  • Neural Networks

    Nielson’s implementation for regularized neuralnetworks

    # define a 3-layer neural network with cross-entropy costimport network2net = network2.Network([784, 30, 10],cost=network2.CrossEntropyCost)

    # stochastic gradient descentnet.large_weight_initializer()net.SGD(training_data, 30, 10, 0.5, evaluation_data=test_data,lmbda=5.0, monitor_evaluation_accuracy=True, monitor_training_accuracy=True)

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 54/60

  • Neural Networks

    Remark. Two more techniques to deal with overfitting are (see Nielson’s book,Chapter 3)

    • dropout, and

    • artificial expansion of training data.

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 55/60

  • Neural Networks

    How to initialize weights and biasesThe biases b`j for all neurons are initialized as standard Gaussian random variables.

    Regarding weight initialization:

    • First idea: Initialize w`jk also as standard Gaussian random variables.

    • Better idea: For each neuron, initialize the input weights as Gaussianrandom variables with mean 0 and standard deviation 1/√nin, where ninis the number of input weights to this neuron.

    Why the second idea is better: the total input to the neuron z`j =∑w`jka

    `−1k +b`j

    has small standard deviation around zero so that the neuron starts in the middle,not from the two ends (see Nielson’s book, Chapter 3).

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 56/60

  • Neural Networks

    Python codes for neural networks with betterinitialization

    # define a 3-layer neural network with cross-entropy costimport network2net = network2.Network([784, 30, 10],cost=network2.CrossEntropyCost)

    # stochastic gradient descentnet.large_weight_initializer()net.SGD(training_data, 30, 10, 0.5, evaluation_data=test_data,lmbda=5.0, monitor_evaluation_accuracy=True, monitor_training_accuracy=True)

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 57/60

  • Neural Networks

    How to set the hyper-parametersParameter tuning for neural networks is hard and often requires specialist knowl-edge.

    • Rules of thumb: Start with subsets of data and small networks, e.g.,

    – consider only two classes (digits 0 and 1)

    – train a (784,10) network first, and then sth like (784, 30, 10) later

    – monitor the validation accuracy more often, say, after every 1,000training images.

    and play with the parameters in order to get quick feedback from experi-ments.

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 58/60

  • Neural Networks

    Once things get improved, vary each hyperparameter separately (whilefixing the rest) until the result stops improving (though this may only giveyou a locally optimal combination).

    • Automated approaches:

    – Grid search

    – Bayesian optimization

    See the references given in Nielson’s book (Chapter 3).

    Finally, remember that “the space of hyper-parameters is so large that one neverreally finishes optimizing, one only abandons the network to posterity.”

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 59/60

  • Neural Networks

    Summary• Presented what neural networks are and how to train them

    – Backpropagation

    – Gradient descent

    – Practical considerations

    • Neural networks are new, flexible and powerful

    • Neural networks are also an art to master

    Dr. Guangliang Chen | Mathematics & Statistics, San José State University 60/60