neural networks - machine learningvda.univie.ac.at/teaching/ml/16s/lecturenotes/05... · for fixed...
TRANSCRIPT
Feed-forward Networks Network Training Error Backpropagation Applications
Neural NetworksMachine Learning
Torsten Möller
©Möller/Mori 1
Feed-forward Networks Network Training Error Backpropagation Applications
Neural Networks
• Neural networks arise from attempts to model human/animalbrains
• Many models, many claims of biological plausibility• We will focus on multi-layer perceptrons
• Mathematical properties rather than plausibility
©Möller/Mori 2
Feed-forward Networks Network Training Error Backpropagation Applications
Outline
Feed-forward Networks
Network Training
Error Backpropagation
Applications
©Möller/Mori 3
Feed-forward Networks Network Training Error Backpropagation Applications
Outline
Feed-forward Networks
Network Training
Error Backpropagation
Applications
©Möller/Mori 4
Feed-forward Networks Network Training Error Backpropagation Applications
Feed-forward Networks• We have looked at generalized linear models of the form:
y(x,w) = f
M∑j=1
wjϕj(x)
for fixed non-linear basis functions ϕ(·)
• We now extend this model by allowing adaptive basisfunctions, and learning their parameters
• In feed-forward networks (a.k.a. multi-layer perceptrons) welet each basis function be another non-linear function oflinear combination of the inputs:
ϕj(x) = f
M∑j=1
. . .
©Möller/Mori 5
Feed-forward Networks Network Training Error Backpropagation Applications
Feed-forward Networks• We have looked at generalized linear models of the form:
y(x,w) = f
M∑j=1
wjϕj(x)
for fixed non-linear basis functions ϕ(·)
• We now extend this model by allowing adaptive basisfunctions, and learning their parameters
• In feed-forward networks (a.k.a. multi-layer perceptrons) welet each basis function be another non-linear function oflinear combination of the inputs:
ϕj(x) = f
M∑j=1
. . .
©Möller/Mori 6
Feed-forward Networks Network Training Error Backpropagation Applications
Feed-forward Networks• Starting with input x = (x1, . . . , xD), construct linear
combinations:
aj =D∑i=1
w(1)ji xi + w
(1)j0
These aj are known as activations• Pass through an activation function h(·) to get outputzj = h(aj)
• Model of an individual neuron
from Russell and Norvig, AIMA2e ©Möller/Mori 7
Feed-forward Networks Network Training Error Backpropagation Applications
Feed-forward Networks• Starting with input x = (x1, . . . , xD), construct linear
combinations:
aj =D∑i=1
w(1)ji xi + w
(1)j0
These aj are known as activations• Pass through an activation function h(·) to get outputzj = h(aj)
• Model of an individual neuron
from Russell and Norvig, AIMA2e ©Möller/Mori 8
Feed-forward Networks Network Training Error Backpropagation Applications
Feed-forward Networks• Starting with input x = (x1, . . . , xD), construct linear
combinations:
aj =D∑i=1
w(1)ji xi + w
(1)j0
These aj are known as activations• Pass through an activation function h(·) to get outputzj = h(aj)
• Model of an individual neuron
from Russell and Norvig, AIMA2e ©Möller/Mori 9
Feed-forward Networks Network Training Error Backpropagation Applications
Activation Functions
• Can use a variety of activation functions• Sigmoidal (S-shaped)
• Logistic sigmoid 1/(1 + exp(−a)) (useful for binaryclassification)
• Hyperbolic tangent tanh• Radial basis function zj =
∑i(xi − wji)
2
• Softmax• Useful for multi-class classification
• Identity• Useful for regression
• Threshold• …
• Needs to be differentiable for gradient-based learning (later)• Can use different activation functions in each unit
©Möller/Mori 10
Feed-forward Networks Network Training Error Backpropagation Applications
Feed-forward Networks
x0
x1
xD
z0
z1
zM
y1
yK
w(1)MD w
(2)KM
w(2)10
hidden units
inputs outputs
• Connect together a number of these units into afeed-forward network (DAG)
• Above shows a network with one layer of hidden units• Implements function:
yk(x,w) = h
M∑j=1
w(2)kj h
(D∑i=1
w(1)ji xi + w
(1)j0
)+ w
(2)k0
©Möller/Mori 11
Feed-forward Networks Network Training Error Backpropagation Applications
Outline
Feed-forward Networks
Network Training
Error Backpropagation
Applications
©Möller/Mori 12
Feed-forward Networks Network Training Error Backpropagation Applications
Network Training• Given a specified network structure, how do we set its
parameters (weights)?• As usual, we define a criterion to measure how well our
network performs, optimize against it• For regression, training data are (xn, t), tn ∈ R
• Squared error naturally arises:
E(w) =
N∑n=1
{y(xn,w)− tn}2
• For binary classification, this is another discriminative model,ML:
p(t|w) =N∏
n=1
ytnn {1− yn}1−tn
E(w) = −N∑
n=1
{tn ln yn + (1− tn) ln(1− yn)}
©Möller/Mori 13
Feed-forward Networks Network Training Error Backpropagation Applications
Network Training• Given a specified network structure, how do we set its
parameters (weights)?• As usual, we define a criterion to measure how well our
network performs, optimize against it• For regression, training data are (xn, t), tn ∈ R
• Squared error naturally arises:
E(w) =
N∑n=1
{y(xn,w)− tn}2
• For binary classification, this is another discriminative model,ML:
p(t|w) =N∏
n=1
ytnn {1− yn}1−tn
E(w) = −N∑
n=1
{tn ln yn + (1− tn) ln(1− yn)}
©Möller/Mori 14
Feed-forward Networks Network Training Error Backpropagation Applications
Network Training• Given a specified network structure, how do we set its
parameters (weights)?• As usual, we define a criterion to measure how well our
network performs, optimize against it• For regression, training data are (xn, t), tn ∈ R
• Squared error naturally arises:
E(w) =
N∑n=1
{y(xn,w)− tn}2
• For binary classification, this is another discriminative model,ML:
p(t|w) =N∏
n=1
ytnn {1− yn}1−tn
E(w) = −N∑
n=1
{tn ln yn + (1− tn) ln(1− yn)}
©Möller/Mori 15
Feed-forward Networks Network Training Error Backpropagation Applications
Parameter Optimization
w1
w2
E(w)
wA wB wC
∇E
• For either of these problems, the error function E(w) is nasty
• Nasty = non-convex• Non-convex = has local minima
©Möller/Mori 16
Feed-forward Networks Network Training Error Backpropagation Applications
Descent Methods
• The typical strategy for optimization problems of this sort is adescent method:
w(τ+1) = w(τ) +∆w(τ)
• As we’ve seen before, these come in many flavours• Gradient descent ∇E(w(τ))• Stochastic gradient descent ∇En(w
(τ))• Newton-Raphson (second order) ∇2
• All of these can be used here, stochastic gradient descent isparticularly effective
• Redundancy in training data, escaping local minima
©Möller/Mori 17
Feed-forward Networks Network Training Error Backpropagation Applications
Descent Methods
• The typical strategy for optimization problems of this sort is adescent method:
w(τ+1) = w(τ) +∆w(τ)
• As we’ve seen before, these come in many flavours• Gradient descent ∇E(w(τ))• Stochastic gradient descent ∇En(w
(τ))• Newton-Raphson (second order) ∇2
• All of these can be used here, stochastic gradient descent isparticularly effective
• Redundancy in training data, escaping local minima
©Möller/Mori 18
Feed-forward Networks Network Training Error Backpropagation Applications
Descent Methods
• The typical strategy for optimization problems of this sort is adescent method:
w(τ+1) = w(τ) +∆w(τ)
• As we’ve seen before, these come in many flavours• Gradient descent ∇E(w(τ))• Stochastic gradient descent ∇En(w
(τ))• Newton-Raphson (second order) ∇2
• All of these can be used here, stochastic gradient descent isparticularly effective
• Redundancy in training data, escaping local minima
©Möller/Mori 19
Feed-forward Networks Network Training Error Backpropagation Applications
Computing Gradients
• The function y(xn,w) implemented by a network iscomplicated
• It isn’t obvious how to compute error function derivatives withrespect to weights
• Numerical method for calculating error derivatives, use finitedifferences:
∂En
∂wji≈ En(wji + ϵ)− En(wji − ϵ)
2ϵ
• How much computation would this take with W weights inthe network?
• O(W ) per derivative, O(W 2) total per gradient descent step
©Möller/Mori 20
Feed-forward Networks Network Training Error Backpropagation Applications
Computing Gradients
• The function y(xn,w) implemented by a network iscomplicated
• It isn’t obvious how to compute error function derivatives withrespect to weights
• Numerical method for calculating error derivatives, use finitedifferences:
∂En
∂wji≈ En(wji + ϵ)− En(wji − ϵ)
2ϵ
• How much computation would this take with W weights inthe network?
• O(W ) per derivative, O(W 2) total per gradient descent step
©Möller/Mori 21
Feed-forward Networks Network Training Error Backpropagation Applications
Computing Gradients
• The function y(xn,w) implemented by a network iscomplicated
• It isn’t obvious how to compute error function derivatives withrespect to weights
• Numerical method for calculating error derivatives, use finitedifferences:
∂En
∂wji≈ En(wji + ϵ)− En(wji − ϵ)
2ϵ
• How much computation would this take with W weights inthe network?
• O(W ) per derivative, O(W 2) total per gradient descent step
©Möller/Mori 22
Feed-forward Networks Network Training Error Backpropagation Applications
Computing Gradients
• The function y(xn,w) implemented by a network iscomplicated
• It isn’t obvious how to compute error function derivatives withrespect to weights
• Numerical method for calculating error derivatives, use finitedifferences:
∂En
∂wji≈ En(wji + ϵ)− En(wji − ϵ)
2ϵ
• How much computation would this take with W weights inthe network?
• O(W ) per derivative, O(W 2) total per gradient descent step
©Möller/Mori 23
Feed-forward Networks Network Training Error Backpropagation Applications
Outline
Feed-forward Networks
Network Training
Error Backpropagation
Applications
©Möller/Mori 24
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation
• Backprop is an efficient method for computing errorderivatives ∂En
∂wji
• O(W ) to compute derivatives wrt all weights• First, feed training example xn forward through the network,
storing all activations aj• Calculating derivatives for weights connected to output
nodes is easy• e.g. For linear output nodes yk =
∑i wkizi:
∂En
∂wki=
∂
∂wki
1
2(yn − tn)
2 = (yn − tn)xni
• For hidden layers, propagate error backwards from theoutput nodes
©Möller/Mori 25
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation
• Backprop is an efficient method for computing errorderivatives ∂En
∂wji
• O(W ) to compute derivatives wrt all weights• First, feed training example xn forward through the network,
storing all activations aj• Calculating derivatives for weights connected to output
nodes is easy• e.g. For linear output nodes yk =
∑i wkizi:
∂En
∂wki=
∂
∂wki
1
2(yn − tn)
2 = (yn − tn)xni
• For hidden layers, propagate error backwards from theoutput nodes
©Möller/Mori 26
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation
• Backprop is an efficient method for computing errorderivatives ∂En
∂wji
• O(W ) to compute derivatives wrt all weights• First, feed training example xn forward through the network,
storing all activations aj• Calculating derivatives for weights connected to output
nodes is easy• e.g. For linear output nodes yk =
∑i wkizi:
∂En
∂wki=
∂
∂wki
1
2(yn − tn)
2 = (yn − tn)xni
• For hidden layers, propagate error backwards from theoutput nodes
©Möller/Mori 27
Feed-forward Networks Network Training Error Backpropagation Applications
Chain Rule for Partial Derivatives
• A “reminder”• For f(x, y), with f differentiable wrt x and y, and x and y
differentiable wrt u and v:∂f
∂u=
∂f
∂x
∂x
∂u+
∂f
∂y
∂y
∂u
and
∂f
∂v=
∂f
∂x
∂x
∂v+
∂f
∂y
∂y
∂v
©Möller/Mori 28
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation• We can write
∂En
∂wji=
∂
∂wjiEn(aj1 , aj2 , . . . , ajm)
where {ji} are the indices of the nodes in the same layer asnode j
• Using the chain rule:∂En
∂wji=
∂En
∂aj
∂aj∂wji
+∑k
∂En
∂ak
∂ak∂wji
where∑
k runs over all other nodes k in the same layer asnode j.
• Since ak does not depend on wji, all terms in the summationgo to 0
∂En
∂wji=
∂En
∂aj
∂aj∂wji
©Möller/Mori 29
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation• We can write
∂En
∂wji=
∂
∂wjiEn(aj1 , aj2 , . . . , ajm)
where {ji} are the indices of the nodes in the same layer asnode j
• Using the chain rule:∂En
∂wji=
∂En
∂aj
∂aj∂wji
+∑k
∂En
∂ak
∂ak∂wji
where∑
k runs over all other nodes k in the same layer asnode j.
• Since ak does not depend on wji, all terms in the summationgo to 0
∂En
∂wji=
∂En
∂aj
∂aj∂wji
©Möller/Mori 30
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation• We can write
∂En
∂wji=
∂
∂wjiEn(aj1 , aj2 , . . . , ajm)
where {ji} are the indices of the nodes in the same layer asnode j
• Using the chain rule:∂En
∂wji=
∂En
∂aj
∂aj∂wji
+∑k
∂En
∂ak
∂ak∂wji
where∑
k runs over all other nodes k in the same layer asnode j.
• Since ak does not depend on wji, all terms in the summationgo to 0
∂En
∂wji=
∂En
∂aj
∂aj∂wji
©Möller/Mori 31
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation cont.
• Introduce error δj ≡ ∂En∂aj
∂En
∂wji= δj
∂aj∂wji
• Other factor is:∂aj∂wji
=∂
∂wji
∑k
wjkzk = zi
©Möller/Mori 32
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation cont.
• Error δj can also be computed using chain rule:
δj ≡∂En
∂aj=∑k
∂En
∂ak︸︷︷︸δk
∂ak∂aj
where∑
k runs over all nodes k in the layer after node j.• Eventually:
δj = h′(aj)∑k
wkjδk
• A weighted sum of the later error “caused” by this weight
©Möller/Mori 33
Feed-forward Networks Network Training Error Backpropagation Applications
Error Backpropagation cont.
• Error δj can also be computed using chain rule:
δj ≡∂En
∂aj=∑k
∂En
∂ak︸︷︷︸δk
∂ak∂aj
where∑
k runs over all nodes k in the layer after node j.• Eventually:
δj = h′(aj)∑k
wkjδk
• A weighted sum of the later error “caused” by this weight
©Möller/Mori 34
Feed-forward Networks Network Training Error Backpropagation Applications
Outline
Feed-forward Networks
Network Training
Error Backpropagation
Applications
©Möller/Mori 35
Feed-forward Networks Network Training Error Backpropagation Applications
Applications of Neural Networks
• Many success stories for neural networks• Credit card fraud detection• Hand-written digit recognition• Face detection• Autonomous driving (CMU ALVINN)
©Möller/Mori 36
Feed-forward Networks Network Training Error Backpropagation Applications
Hand-written Digit Recognition
• MNIST - standard dataset for hand-written digit recognition• 60000 training, 10000 test images
©Möller/Mori 37
Feed-forward Networks Network Training Error Backpropagation Applications
LeNet-5
INPUT
32x32
Convolutions SubsamplingConvolutions
C1: feature maps 6@28x28
Subsampling
S2: f. maps6@14x14
S4: f. maps 16@5x5
C5: layer120
C3: f. maps 16@10x10
F6: layer 84
Full connection
Full connection
Gaussian connections
OUTPUT
10
• LeNet developed by Yann LeCun et al.• Convolutional neural network
• Local receptive fields (5x5 connectivity)• Subsampling (2x2)• Shared weights (reuse same 5x5 “filter”)• Breaking symmetry
©Möller/Mori 38
Feed-forward Networks Network Training Error Backpropagation Applications
4!>6 3!>5 8!>2 2!>1 5!>3 4!>8 2!>8 3!>5 6!>5 7!>3
9!>4 8!>0 7!>8 5!>3 8!>7 0!>6 3!>7 2!>7 8!>3 9!>4
8!>2 5!>3 4!>8 3!>9 6!>0 9!>8 4!>9 6!>1 9!>4 9!>1
9!>4 2!>0 6!>1 3!>5 3!>2 9!>5 6!>0 6!>0 6!>0 6!>8
4!>6 7!>3 9!>4 4!>6 2!>7 9!>7 4!>3 9!>4 9!>4 9!>4
8!>7 4!>2 8!>4 3!>5 8!>4 6!>5 8!>5 3!>8 3!>8 9!>8
1!>5 9!>8 6!>3 0!>2 6!>5 9!>5 0!>7 1!>6 4!>9 2!>1
2!>8 8!>5 4!>9 7!>2 7!>2 6!>5 9!>7 6!>1 5!>6 5!>0
4!>9 2!>8
• The 82 errors made by LeNet5 (0.82% test error rate)
©Möller/Mori 39
Feed-forward Networks Network Training Error Backpropagation Applications
Deep Learning
• Collection of important techniques to improve performance:• Multi-layer networks (as in LeNet 5), auto-encoders for
unsupervised feature learning, sparsity, convolutionalnetworks, parameter tying, probabilistic models for initializingparameters, hinge activation functions for steeper gradients
• ufldl.stanford.edu/eccv10-tutorial• http://www.image-net.org/challenges/LSVRC/2012/
supervision.pdf
• Scalability is key, can use lots of data since stochasticgradient descent is memory-efficient
©Möller/Mori 40
Feed-forward Networks Network Training Error Backpropagation Applications
Conclusion
• Readings: Ch. 5.1, 5.2, 5.3• Feed-forward networks can be used for regression or
classification• Similar to linear models, except with adaptive non-linear basis
functions• These allow us to do more than e.g. linear decision
boundaries• Different error functions• Learning is more difficult, error function not convex
• Use stochastic gradient descent, obtain (good?) localminimum
• Backpropagation for efficient gradient computation
©Möller/Mori 41