neuralnetworks · 2020. 11. 24. · dr. guangliang chen | mathematics & statistics, san josé state...
TRANSCRIPT
-
San José State University
Math 251: Statistical and Machine Learning Classification
Neural Networks
Dr. Guangliang Chen
-
Outline of the presentation:
• Overview
– What is a neural network
– What is a neuron
• Perceptrons
• Sigmoid neurons network
• Summary
-
Neural Networks
Acknowledgments
This presentation is based on the following references:
• Nielsen’s online book “Neural Networks and Deep Learning”1
• Olga Veksler’s lecture on neural networks2
1http://neuralnetworksanddeeplearning.com2http://www.csd.uwo.ca/courses/CS9840a/Lecture10_NeuralNets.pdf
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 3/60
-
Neural Networks
What is an artificial neural network?
bbb
x1
x2
xd
y2
y1
yk
bbb
Input layer Hidden layer(s) Output layer The leftmost layer inputs features xi.
The rightmost layer outputs predictionsyi.
The solid circles represent neurons,which process inputs from precedinglayer and output results for next layer.
The neural network is called a deepnetwork if it has more than one layer(otherwise, it is said to be shallow).
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 4/60
-
Neural Networks
ANN for MNIST handwritten digits recognition
bbb
x1
x2
xd
1
0
9
bbb
Input layer Hidden layer(s) Output layer
784 pixels10 classes
abstraction
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 5/60
-
Neural Networks
What is a biological neuron?• Neurons (or nerve cells) are special cells that process and transmit infor-
mation by electrical signaling (in brain and also spinal cord)
• Human brain has around 1011 neurons
• A neuron connects to other neurons to form a network
• Each neuron cell communicates to between 1000 and 10,000 other neurons
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 6/60
-
Neural Networks
Components of a biological neuron
dendrites:
• “input wires”, receive inputsfrom other neurons
• a neuron may have thousands ofdendrites, usually short
cell body: computational unit
axon:
• “output wire”, sends signal toother neurons
• single long structure (up to 1 m)
• splits in possibly thousands ofbranches at the end
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 7/60
-
Neural Networks
Artificial neurons are mathematical functions
x1
x2
xd
w1
w2
wd
b
f
bbb
f(w · x + b)
In the above,
• wi: weights, b: bias, and f : activation function
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 8/60
-
Neural Networks
Two simple activation functions• Heaviside step function: H(z) = 1z>0
• Sigmoid: σ(z) = 11+e−z
z-3 -2 -1 0 1 2 3
0
0.5
1Heaviside H(z)Sigmoid σ(z)
The corresponding neurons are called perceptrons and sigmoid neurons.
We will mention several other activation functions in the end.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 9/60
-
Neural Networks
ANN is a composition of functions!
• Each neuron is a function
• It accepts inputs from previouslayer and outputs for next layer
It can be proved that every continuousfunction can be implemented with 1hidden layer (containing enough hiddenunits) and proper nonlinear activationfunctions.
This is more of theoretical than practicalinterest.
bbb
x1
x2
xd
y2
y1
yk
bbb
Input layer Hidden layer(s) Output layer
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 10/60
-
Neural Networks
How to train ANNs in principle1. Select an activation function for all neurons.
2. Tune weights and biases at all neurons to match prediction and truth “asclosely as possible”:
• formulate an objective or loss function L
• optimize it with gradient descent
– the technique is called backpropagation
– lots of notation due to complex from of gradient
– lots of tricks to get gradient descent work reasonably well
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 11/60
-
Neural Networks
PerceptronsA perceptron is a neuron whose activation function is the Heaviside step function.It defines a linear, binary classifier (not necessarily optimal).
x1
x2
xd
w1
w2
wd
b
bbb
sgn(w · x + b)
b
b
b
b
bb
b
b
b
b
bb
b-11
b
b
w · x + b = 0
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 12/60
-
Neural Networks
Derivation of the Perceptron loss function
• If a point xi is misclassified, thenyi(w ·xi + b) < 0 (implying that−yi(w · xi + b) > 0, which canbe regarded as penalty/loss).
⊗b
b
b
b⊗
b
b
⊗
b
bb
b
w · x + b = 0
-11
b
b
• Denote the set of misclassifiedpoints byM.
• The goal is to minimize the totalloss
`(w, b) = −∑i∈M
yi(w · xi + b)
• If ` gets to zero, we know wehave the best possible solution(M empty → no training error)
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 13/60
-
Neural Networks
How to minimize the perceptron loss
The perceptron loss contains a discrete object (i.e. M) that depends on thevariables w, b, making it hard to solve analytically.
To obtain an approximate solution, use gradient descent:
• Initialize weights w and bias b (which would determineM).
• Iterate until stopping criterion is met (see next page):
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 14/60
-
Neural Networks
– Given M: The gradient may be computed as follows
∂`
∂w = −∑i∈M
yixi
∂`
∂b= −
∑i∈M
yi
We then update w, b as follows:
w←− w + ρ∑i∈M
yixi
b←− b+ ρ∑i∈M
yi
where ρ > 0 is a parameter, called learning rate.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 15/60
-
Neural Networks
– Given w, b: updateM as the set of new errors:
M = {1 ≤ i ≤ n | yi(w · xi + b) < 0}
⊗b
b
b
b⊗
b
b
⊗
b
bb
b-11
b
b
b
b
b
b
bb
b
b
b
b
bb
b-11
b
b
gradient descent
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 16/60
-
Neural Networks
Stochastic gradient descentThe gradient descent approach we use so far assumes that we have access to thefull training set, and uses all training data to iteratively update weights and bias.
This may be slow for large data sets, or impractical in the setting of online learning(where data comes sequentially).
A variant of gradient descent, called stochastic gradient descent, uses
• only a single training point, or
• a small subset of examples (called mini-batch),
each round to update weights and bias.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 17/60
-
Neural Networks
• Single-sample update rule:
– Start with a random hyperplane (with corresponding w and b)
– Randomly select a new point xi from the training set: if it lies onthe correct side, no change; otherwise update
w←− w + ρyixib←− b+ ρyi
– Repeat until all examples have been visited (this is called an epoch)
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 18/60
-
Neural Networks
• Mini-batch update rule:
– Divide training data into mini-batches (of size 5, or 10), and updateweights after processing each mini-batch
w←− w +∑
ρyixi
b←− b+ ρ∑
yi
– Middle ground between single sample and full training set
– One iteration over all mini-batches is called an epoch
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 19/60
-
Neural Networks
Comments on stochastic gradient descent
• Single-sample update rule applies to online learning
• Faster than full gradient descent, but maybe less stable
• Mini-batch update rule might achieve some balance between speed andstability
• May find only a local minimum (the hyperplane is trapped in a suboptimallocation)
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 20/60
-
Neural Networks
Some remarks about the Perceptron algorithm• If the classes are linearly separable, the algorithm converges to a separating
hyperplane in a finite number of steps, but not necessarily optimal.
• The number of steps can be very large. The smaller the margin (betweenthe classes), the longer it takes to find it.
• When the data are not separable, the algorithm will not converge, andcycles develop (which can be long and therefore hard to detect).
• It is thus not a good classifier, but it is conceptually very important (neuron,loss function, gradient descent).
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 21/60
-
Neural Networks
Multilayer perceptrons (MLP)
bbb
x1
x2
xd
y2
y1
yk
bbb
Input layer Hidden layer(s) Output layer MLP is a network of perceptrons.
However, each perceptron has a dis-crete behavior, making its effect onlatter layers hard to predict.
Next we will look at the network of sig-moid neurons.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 22/60
-
Neural Networks
Sigmoid neurons
Sigmoid neurons are smoothed-out (orsoft) versions of the perceptrons:
A small change in any weight or biascauses only a small change in the out-put.
We say the neuron is in low (high) acti-vation if the output is near 0 (1).
When the neuron is in high activation,we say that it fires.
σ(w · x + b) = 11 + e−(w·x+b)
x1
x2
xd
w1
w2
wd
b
bbb
σ(w · x+ b)
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 23/60
-
Neural Networks
The sigmoid neurons networkThe output of such a network continuously depends on its weights and biases (soeverything is more predictable comparing to the MLP).
bbb
x1
x2
xd
y2
y1
yk
bbb
Input layer Hidden layer(s) Output layer
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 24/60
-
Neural Networks
How do we train a neural network?
• Notation, notation, notation
• Backpropagation
• Practical issues and solutions
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 25/60
-
Neural Networks
Notation
For each ` = 1, . . . , L:
- w`jk: layer `, “j back to k” weight;
- b`j : layer `, neuron j bias
- a`j : layer `, neuron j output
- z`j =∑
k w`jka
`−1k +b`j : weighted input
to neuron j in layer `
Note that a`j = σ(z`j).
b b b
neuron k
b bb
layer ℓ− 1 layer ℓ layer L
neuron jwℓjk
bℓj
zℓj aℓj
σ
neuron j
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 26/60
-
Neural Networks
Notation (vector form)
W` =(w`jk
)j,k
: matrix of all weightsbetween layers `− 1 and `;
b` =(b`j)
j: vector of biases in layer `
z` =(z`j)
j: vector of weighted inputs
to neurons in layer `
a` =(a`j)
j: vector of outputs from
neurons in layer `
We write a` = σ(z`)(componentwise).
b b b
neuron k
b bb
layer ℓ− 1 layer ℓ layer L
neuron jwℓjk
bℓj
aℓ−1σ zℓ aℓσ
Wℓ,bℓ
b bbaL
b b b
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 27/60
-
Neural Networks
The feedforward relationship
First note that
• Input layer is indexed by ` = 0so that a0 = x.
• aL is the network output.
For each 1 ≤ ` ≤ L,
z` = W`a`−1 + b`
a` = σ(z`)
b b b
neuron k
b bb
layer ℓ− 1 layer ℓ layer L
neuron jwℓjk
bℓj
aℓ−1σ zℓ aℓσ
Wℓ,bℓ
b bbaL
b b b
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 28/60
-
Neural Networks
The network lossTo tune the weights and biases of a network of sigmoid neurons, we need to selecta loss function.
We first consider the square loss due to its simplicity
C({W`,b`}1≤`≤L) =1
2n
n∑i=1‖aL(xi)− yi‖2
where
• aL(xi) is the network output when inputing a training example xi.
• yi is the training label (coded by a vector).
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 29/60
-
Neural Networks
Remark. In our setting, the labels are coded as follows:
digit 0 =
10...0
, digit 1 =
01...0
, . . . ,digit 9 =
00...1
Therefore, by varying the weights and biases, we try to minimize the differencebetween each network output aL(xi) and one of the vectors above (associatedto the training class that xi belongs to).
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 30/60
-
Neural Networks
Gradient descentThe network loss has too many variables to be minimized analytically:
C({W`,b`}1≤`≤L) =1
2n
n∑i=1‖aL(xi)− yi‖2
We’ll use gradient descent to attack the problem. However, computing all thepartial derivatives ∂C
∂w`jk
, ∂C∂b`
j
is highly nontrivial.
To simplify the task a bit, we consider a sample of size 1 consisting of only xi:
Ci({W`,b`}1≤`≤L) =12‖a
L(xi)− yi‖2 =12∑
j
(aLj − yi(j))2
which is enough as ∂C∂w`
jk
= 1n∑
i∂Ci
∂w`jk
and ∂C∂b`
j
= 1n∑
i∂Ci∂b`
j
.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 31/60
-
Neural Networks
The output layer first
We start by computing ∂Ci∂wL
jk
, ∂Ci∂bL
j
asthey are the easiest.
b b b
neuron k
layer L− 1 layer L (output layer)
neuron j
wLjk
bLj
aLj
aL1
b
b
b
Ci
b b b
neuron k
b bb
layer ℓ− 1 layer ℓ layer L
neuron jwℓjk
bℓj
aℓ−1σ zℓ aℓσ
Wℓ,bℓ
b bbaL
b b b
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 32/60
-
Neural Networks
Computing ∂Ci∂wLjk
, ∂Ci∂bLj
for the output layer
By chain rule we have
∂Ci∂wLjk
= ∂Ci∂aLj
·∂aLj∂wLjk
where ∂Ci∂aL
j
= aLj − yi(j) for square lossand
∂aLj∂wLjk
=∂aLj∂zLj
·∂zLj∂wLjk
= σ′(zLj )aL−1k
which is obtained by applying chain ruleagain with the formula for aLj .
b b b
neuron k
layer L− 1 layer L (output layer)
neuron j
wLjk
bLj
aLj
aL1
b
b
b
Ci
aLj = σ(zLj), zLj =
∑k′
wLjk′aL−1k′ +b
Lj
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 33/60
-
Neural Networks
Computing ∂Ci∂wLjk
, ∂Ci∂bLj
for the output layer
Combining results gives that
∂Ci∂wLjk
= ∂Ci∂aLj
·∂aLj∂wLjk
=(aLj − yi(j)
)σ′(zLj )aL−1k .
Similarly, we obtain that
∂Ci∂bLj
= ∂Ci∂aLj
·∂aLj∂bLj
=(aLj − yi(j)
)σ′(zLj ).
b b b
neuron k
layer L− 1 layer L (output layer)
neuron j
wLjk
bLj
aLj
aL1
b
b
b
Ci
aLj = σ(zLj), zLj =
∑k′
wLjk′aL−1k′ +b
Lj
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 34/60
-
Neural Networks
Interpretation of the formula for ∂Ci∂wLjk
Observe that the rate of change of Ciw.r.t. wLjk depends on three factors (∂Ci∂bL
j
only depends on the first two):
• aLj − yi(j): how much currentoutput is off from desired output
• σ′(zLj ): how fast the neuron re-acts to changes of its input
• aL−1k : contribution from neuronk in layer L− 1
b b b
neuron k
layer L− 1 layer L (output layer)
neuron j
wLjk
bLj
aLj
aL1
b
b
b
Ci
Thus, wLjk will learn slowly if the inputneuron is in low-activation (aL−1k ≈ 0),or the output neuron has “saturated”,i.e., is in either high- or low-activation(in both cases σ′(zLj ) ≈ 0).
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 35/60
-
Neural Networks
What about layer L− 1 (and further inside)?
b b b
neuron k
layer L− 1 output layer
neuron j
wL−1kq
bL−1k
aLj
aL1
b
b
b
Cineuron q
layer L− 2
aL−1k
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 36/60
-
Neural Networks
b b b
neuron k
layer L− 1 output layer
neuron j
wL−1kq
bL−1k
aLj
aL1
b
b
b
Cineuron q
layer L− 2
aL−1k
By chain rule,
∂Ci
∂wL−1kq=∑
j
∂Ci∂aLj
∂aLj
∂wL−1kq=∑
j
∂Ci∂aLj
∂aLj
∂aL−1k
∂aL−1k∂wL−1kq
where
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 37/60
-
Neural Networks
b b b
neuron k
layer L− 1 output layer
neuron j
wL−1kq
bL−1k
aLj
aL1
b
b
b
Cineuron q
layer L− 2
aL−1k
• ∂Ci∂aL
j
: already computed (in the output layer);
• ∂aLj
∂aL−1k
= σ′(zLj )wLjk: link between layers L and L− 1 ;
• ∂aL−1k
∂wL−1kq
: similarly computed as in the output layer
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 38/60
-
Neural Networks
b b b
neuron k
layer L− 1 output layer
neuron j
aLj
aL1
b
b
b
Ci
layer ℓ + 1
aL−1k
neuron p
wℓqr
bℓqneuron q
layer ℓ
aℓ+1p
aℓq
As we move further inside the network (from the output layer), we will need tocompute more and more links between layers:
∂Ci∂w`qr
=∑
p,...,k,j
∂a`q∂w`pq
∂a`+1p∂a`q
. . .∂aLj
∂aL−1k
∂Ci∂aLj
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 39/60
-
Neural Networks
The backpropagation algorithmThe products of the link terms may be computed iteratively from right to left,leading to an efficient algorithm for computing all ∂Ci
∂w`jk
, ∂Ci∂b`
j
(based on only xi):
• Feedforward xi to obtain all neuron inputs and outputs:
a0 = xi; a` = σ(W`a`−1 + b`), for ` = 1, . . . , L
• Backpropagate the network to compute
∂aLj∂a`q
=∑
p,...,k
∂a`+1p∂a`q
. . .∂aLj
∂aL−1k, for ` = L, . . . , 1
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 40/60
-
Neural Networks
The backpropagation algorithm (cont’d)• Compute ∂Ci
∂w`qr, ∂Ci
∂b`qfor every layer ` and every neuron q or pair of neurons
(q, r) by using
∂Ci∂w`qr
=∑
j
∂a`q∂w`qr
·∂aLj∂a`q· ∂Ci∂aLj
∂Ci∂b`q
=∑
j
∂a`q∂b`q·∂aLj∂a`q· ∂Ci∂aLj
Note that ∂Ci∂aL
j
only needs to computed once.
Remark. The entire backpropagation process can be vectorized, thus can beimplemented efficiently.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 41/60
-
Neural Networks
Stochastic gradient descent• Initialize all the weights w`jk and biases b`j ;
• For each training example xi,
– Use backpropagation to compute the partial derivatives ∂Ci∂w`
jk
, ∂Ci∂b`
j
– Update the weights and biases by:
w`jk ←− w`jk − η ·∂Ci∂w`jk
, b`j ←− b`j − η ·∂Ci∂b`j
This completes one epoch in the training process.
• Repeat the preceding step until convergence.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 42/60
-
Neural Networks
Remark. The previous procedure uses single-sample update rule (one trainingtime each time). We can also use mini-batches {xi}i∈B to perform gradientdescent (for faster speed):
• For every i ∈ B, use backpropagation to compute the partial derivatives∂Ci
∂w`jk
, ∂Ci∂b`
j
• Update the weights and biases by:
w`jk ←− w`jk − η ·1|B|
∑i∈B
∂Ci∂w`jk
,
b`j ←− b`j − η ·1|B|
∑i∈B
∂Ci∂b`j
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 43/60
-
Neural Networks
Software for neural networksMATLAB: Neural networks is not part of the MATLAB Statistics and MachineLearning Toolbox; there is a separate Deep Learning Toolbox.
Python: Nielson has written from scratch excellent Python codes exactlyfor MNIST digits classification, which is available at https://github.com/mnielsen/neural-networks-and-deep-learning/archive/master.zip.
Otherwise, you can directly use the Python MLP function available at https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 44/60
-
Neural Networks
MATLAB demonstration
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 45/60
-
Neural Networks
Nielson’s Python codes for neural networks
# load MNIST data into pythonimport mnist_loadertraining_data, validation_data, test_data = mnist_loader.load_data_wrapper()
# define a 3-layer neural network with number of neurons on each layerimport networknet = network.Network([784, 30, 10])
# execute stochastic gradient descent over 30 epochs and with mini-batchesof size 10 and a learning rate of 3net.SGD(training_data, 30, 10, 3.0, test_data=test_data)
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 46/60
-
Neural Networks
Practical issues and techniques for improvementWe have covered the main ideas of neural networks. There are a lot of practicalissues to consider:
• How to fix learning slowdown
• How to avoid overfitting
• How to initialize the weights and biases for gradient descent
• How to choose the hyperparameters, such as the learning rate, regularizationparameter, and configuration of the network, etc.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 47/60
-
Neural Networks
The learning slowdown issue with square lossConsider for simplicity a single sigmoid neuron
x1
x2
xd
w1
w2
wd
b
bbb
σ(w · x+ b)
The total input and output are z = w · x + b and a = σ(z), respectively.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 48/60
-
Neural Networks
Under the square loss C(w, b) = 12 (a− y)2 we obtain that
∂C
∂wj= (a− y) ∂a
∂wj= (a− y)σ′(z)xj
∂C
∂b= (a− y)∂a
∂b= (a− y)σ′(z)
When z is initially large in magnitude, σ′(z) ≈ 0. This shows that both wj , b willinitially learn very slowly (which could be good or bad):
wj ←− wj − η · (a− y)σ′(z)xj ,b←− b− η · (a− y)σ′(z).
Therefore, the σ′(z) term may cause a learning slowdown when the initial weightedinput z is large in the wrong direction.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 49/60
-
Neural Networks
How to fix the learning slowdown issueFirst method: Use the logistic loss (also called the cross-entropy loss) instead
C(w, b) = −y log(a)− (1− y) log(1− a)
With this loss, we can show that the σ′(z) term is gone:
∂C
∂wj= (a− y)xj
∂C
∂b= a− y
so that gradient descent will move fast when a is far from y.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 50/60
-
Neural Networks
Second method: Add a “softmax output layer” with log-likelihood cost (seeNielson’s book, Chapter 3):
• Define a new type of output layer by changing the activation function fromsigmoid to softmax:
aLj = σ(zLj ) −→ aLj =ez
Lj∑
k ezL
k
• Use the log-likelihood cost (instead of logistic loss)
C =∑
Ci, Ci = − log aLyi
Remark. In general, both techniques (a sigmoid output layer and cross-entropy,or a softmax output layer and log-likelihood) work similarly well. One advantageof the softmax layer is the interpretation of its outputs aLj as probabilities.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 51/60
-
Neural Networks
Nielson’s implementation for neural networks withcross-entropy loss
# define a 3-layer neural network with cross-entropy costimport network2net = network2.Network([784, 30, 10],cost=network2.CrossEntropyCost)
# stochastic gradient descentnet.large_weight_initializer()net.SGD(training_data, 30, 10, 0.5, evaluation_data=test_data,monitor_evaluation_accuracy=True)
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 52/60
-
Neural Networks
How to avoid overfittingNeural networks due to their many parameters are likely to overfit especially whengiven insufficient training data.
Like regularized logistic regression, we can add a regularization term of the form
λ
2n∑j,k,`
|w`jk|p
to any cost function used in order to avoid overfitting.
Typical choices of p are p = 2 (L2-regularization) and p = 1 (L1-regularization)
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 53/60
-
Neural Networks
Nielson’s implementation for regularized neuralnetworks
# define a 3-layer neural network with cross-entropy costimport network2net = network2.Network([784, 30, 10],cost=network2.CrossEntropyCost)
# stochastic gradient descentnet.large_weight_initializer()net.SGD(training_data, 30, 10, 0.5, evaluation_data=test_data,lmbda=5.0, monitor_evaluation_accuracy=True, monitor_training_accuracy=True)
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 54/60
-
Neural Networks
Remark. Two more techniques to deal with overfitting are (see Nielson’s book,Chapter 3)
• dropout, and
• artificial expansion of training data.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 55/60
-
Neural Networks
How to initialize weights and biasesThe biases b`j for all neurons are initialized as standard Gaussian random variables.
Regarding weight initialization:
• First idea: Initialize w`jk also as standard Gaussian random variables.
• Better idea: For each neuron, initialize the input weights as Gaussianrandom variables with mean 0 and standard deviation 1/√nin, where ninis the number of input weights to this neuron.
Why the second idea is better: the total input to the neuron z`j =∑w`jka
`−1k +b`j
has small standard deviation around zero so that the neuron starts in the middle,not from the two ends (see Nielson’s book, Chapter 3).
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 56/60
-
Neural Networks
Python codes for neural networks with betterinitialization
# define a 3-layer neural network with cross-entropy costimport network2net = network2.Network([784, 30, 10],cost=network2.CrossEntropyCost)
# stochastic gradient descentnet.large_weight_initializer()net.SGD(training_data, 30, 10, 0.5, evaluation_data=test_data,lmbda=5.0, monitor_evaluation_accuracy=True, monitor_training_accuracy=True)
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 57/60
-
Neural Networks
How to set the hyper-parametersParameter tuning for neural networks is hard and often requires specialist knowl-edge.
• Rules of thumb: Start with subsets of data and small networks, e.g.,
– consider only two classes (digits 0 and 1)
– train a (784,10) network first, and then sth like (784, 30, 10) later
– monitor the validation accuracy more often, say, after every 1,000training images.
and play with the parameters in order to get quick feedback from experi-ments.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 58/60
-
Neural Networks
Once things get improved, vary each hyperparameter separately (whilefixing the rest) until the result stops improving (though this may only giveyou a locally optimal combination).
• Automated approaches:
– Grid search
– Bayesian optimization
See the references given in Nielson’s book (Chapter 3).
Finally, remember that “the space of hyper-parameters is so large that one neverreally finishes optimizing, one only abandons the network to posterity.”
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 59/60
-
Neural Networks
Summary• Presented what neural networks are and how to train them
– Backpropagation
– Gradient descent
– Practical considerations
• Neural networks are new, flexible and powerful
• Neural networks are also an art to master
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 60/60