neural networks - machine learningvda.univie.ac.at/teaching/ml/16s/lecturenotes/05... · for ﬁxed...

Feed-forward Networks Network Training Error Backpropagation Applications

Neural NetworksMachine Learning

Torsten Möller

©Möller/Mori 1


Neural Networks

• Neural networks arise from attempts to model human/animalbrains

• Many models, many claims of biological plausibility• We will focus on multi-layer perceptrons

• Mathematical properties rather than plausibility

©Möller/Mori 2


Outline

Feed-forward Networks

Network Training

Error Backpropagation

Applications

©Möller/Mori 3


Outline


Network Training


Applications

©Möller/Mori 4


Feed-forward Networks• We have looked at generalized linear models of the form:

y(x,w) = f

M∑j=1

wjϕj(x)

for fixed non-linear basis functions ϕ(·)

• We now extend this model by allowing adaptive basisfunctions, and learning their parameters

• In feed-forward networks (a.k.a. multi-layer perceptrons) welet each basis function be another non-linear function oflinear combination of the inputs:

ϕj(x) = f

M∑j=1

. . .

©Möller/Mori 5


Feed-forward Networks• We have looked at generalized linear models of the form:

y(x,w) = f

M∑j=1

wjϕj(x)

for fixed non-linear basis functions ϕ(·)

• We now extend this model by allowing adaptive basisfunctions, and learning their parameters

• In feed-forward networks (a.k.a. multi-layer perceptrons) welet each basis function be another non-linear function oflinear combination of the inputs:

ϕj(x) = f

M∑j=1

. . .

©Möller/Mori 6


Feed-forward Networks• Starting with input x = (x1, . . . , xD), construct linear

combinations:

aj =D∑i=1

w(1)ji xi + w

(1)j0

These aj are known as activations• Pass through an activation function h(·) to get outputzj = h(aj)

• Model of an individual neuron

from Russell and Norvig, AIMA2e ©Möller/Mori 7



combinations:

aj =D∑i=1

w(1)ji xi + w

(1)j0





Activation Functions

• Can use a variety of activation functions• Sigmoidal (S-shaped)

• Logistic sigmoid 1/(1 + exp(−a)) (useful for binaryclassification)

• Hyperbolic tangent tanh• Radial basis function zj =

∑i(xi − wji)

2

• Softmax• Useful for multi-class classification

• Identity• Useful for regression

• Threshold• …

• Needs to be differentiable for gradient-based learning (later)• Can use different activation functions in each unit

©Möller/Mori 10



x0

x1

xD

z0

z1

zM

y1

yK

w(1)MD w

(2)KM

w(2)10

hidden units

inputs outputs

• Connect together a number of these units into afeed-forward network (DAG)

• Above shows a network with one layer of hidden units• Implements function:

yk(x,w) = h

M∑j=1

w(2)kj h

(D∑i=1

w(1)ji xi + w

(1)j0

)+ w

(2)k0

©Möller/Mori 11


Outline


Network Training


Applications

©Möller/Mori 12


Network Training• Given a specified network structure, how do we set its

parameters (weights)?• As usual, we define a criterion to measure how well our

network performs, optimize against it• For regression, training data are (xn, t), tn ∈ R

• Squared error naturally arises:

E(w) =

N∑n=1

{y(xn,w)− tn}2

• For binary classification, this is another discriminative model,ML:

p(t|w) =N∏

n=1

ytnn {1− yn}1−tn

E(w) = −N∑

n=1

{tn ln yn + (1− tn) ln(1− yn)}

©Möller/Mori 13






E(w) =

N∑n=1

{y(xn,w)− tn}2


p(t|w) =N∏

n=1


E(w) = −N∑

n=1


©Möller/Mori 14






E(w) =

N∑n=1

{y(xn,w)− tn}2


p(t|w) =N∏

n=1


E(w) = −N∑

n=1


©Möller/Mori 15


Parameter Optimization

w1

w2

E(w)

wA wB wC

∇E

• For either of these problems, the error function E(w) is nasty

• Nasty = non-convex• Non-convex = has local minima

©Möller/Mori 16


Descent Methods

• The typical strategy for optimization problems of this sort is adescent method:

w(τ+1) = w(τ) +∆w(τ)

• As we’ve seen before, these come in many flavours• Gradient descent ∇E(w(τ))• Stochastic gradient descent ∇En(w

(τ))• Newton-Raphson (second order) ∇2

• All of these can be used here, stochastic gradient descent isparticularly effective

• Redundancy in training data, escaping local minima

©Möller/Mori 17


Descent Methods


w(τ+1) = w(τ) +∆w(τ)





©Möller/Mori 18


Descent Methods


w(τ+1) = w(τ) +∆w(τ)





©Möller/Mori 19


Computing Gradients

• The function y(xn,w) implemented by a network iscomplicated

• It isn’t obvious how to compute error function derivatives withrespect to weights

• Numerical method for calculating error derivatives, use finitedifferences:

∂En

∂wji≈ En(wji + ϵ)− En(wji − ϵ)

2ϵ

• How much computation would this take with W weights inthe network?

• O(W ) per derivative, O(W 2) total per gradient descent step

©Möller/Mori 20


Computing Gradients




∂En


2ϵ



©Möller/Mori 21


Computing Gradients




∂En


2ϵ



©Möller/Mori 22


Computing Gradients




∂En


2ϵ



©Möller/Mori 23


Outline


Network Training


Applications

©Möller/Mori 24



• Backprop is an efficient method for computing errorderivatives ∂En

∂wji

• O(W ) to compute derivatives wrt all weights• First, feed training example xn forward through the network,

storing all activations aj• Calculating derivatives for weights connected to output

nodes is easy• e.g. For linear output nodes yk =

∑i wkizi:

∂En

∂wki=

∂

∂wki

1

2(yn − tn)

2 = (yn − tn)xni

• For hidden layers, propagate error backwards from theoutput nodes

©Möller/Mori 25




∂wji




∑i wkizi:

∂En

∂wki=

∂

∂wki

1

2(yn − tn)

2 = (yn − tn)xni


©Möller/Mori 26




∂wji




∑i wkizi:

∂En

∂wki=

∂

∂wki

1

2(yn − tn)

2 = (yn − tn)xni


©Möller/Mori 27


Chain Rule for Partial Derivatives

• A “reminder”• For f(x, y), with f differentiable wrt x and y, and x and y

differentiable wrt u and v:∂f

∂u=

∂f

∂x

∂x

∂u+

∂f

∂y

∂y

∂u

and

∂f

∂v=

∂f

∂x

∂x

∂v+

∂f

∂y

∂y

∂v

©Möller/Mori 28


Error Backpropagation• We can write

∂En

∂wji=

∂

∂wjiEn(aj1 , aj2 , . . . , ajm)

where {ji} are the indices of the nodes in the same layer asnode j

• Using the chain rule:∂En

∂wji=

∂En

∂aj

∂aj∂wji

+∑k

∂En

∂ak

∂ak∂wji

where∑

k runs over all other nodes k in the same layer asnode j.

• Since ak does not depend on wji, all terms in the summationgo to 0

∂En

∂wji=

∂En

∂aj

∂aj∂wji

©Möller/Mori 29



∂En

∂wji=

∂

∂wjiEn(aj1 , aj2 , . . . , ajm)



∂wji=

∂En

∂aj

∂aj∂wji

+∑k

∂En

∂ak

∂ak∂wji

where∑



∂En

∂wji=

∂En

∂aj

∂aj∂wji

©Möller/Mori 30



∂En

∂wji=

∂

∂wjiEn(aj1 , aj2 , . . . , ajm)



∂wji=

∂En

∂aj

∂aj∂wji

+∑k

∂En

∂ak

∂ak∂wji

where∑



∂En

∂wji=

∂En

∂aj

∂aj∂wji

©Möller/Mori 31


Error Backpropagation cont.

• Introduce error δj ≡ ∂En∂aj

∂En

∂wji= δj

∂aj∂wji

• Other factor is:∂aj∂wji

=∂

∂wji

∑k

wjkzk = zi

©Möller/Mori 32



• Error δj can also be computed using chain rule:

δj ≡∂En

∂aj=∑k

∂En

∂ak︸︷︷︸δk

∂ak∂aj

where∑

k runs over all nodes k in the layer after node j.• Eventually:

δj = h′(aj)∑k

wkjδk

• A weighted sum of the later error “caused” by this weight

©Möller/Mori 33



• Error δj can also be computed using chain rule:

δj ≡∂En

∂aj=∑k

∂En

∂ak︸︷︷︸δk

∂ak∂aj

where∑

k runs over all nodes k in the layer after node j.• Eventually:

δj = h′(aj)∑k

wkjδk

• A weighted sum of the later error “caused” by this weight

©Möller/Mori 34


Applications of Neural Networks

• Many success stories for neural networks• Credit card fraud detection• Hand-written digit recognition• Face detection• Autonomous driving (CMU ALVINN)

©Möller/Mori 36


Hand-written Digit Recognition

• MNIST - standard dataset for hand-written digit recognition• 60000 training, 10000 test images

©Möller/Mori 37


LeNet-5

INPUT

32x32

Convolutions SubsamplingConvolutions

C1: feature maps 6@28x28

Subsampling

S2: f. maps6@14x14

S4: f. maps 16@5x5

C5: layer120

C3: f. maps 16@10x10

F6: layer 84

Full connection

Full connection

Gaussian connections

OUTPUT

10

• LeNet developed by Yann LeCun et al.• Convolutional neural network

• Local receptive fields (5x5 connectivity)• Subsampling (2x2)• Shared weights (reuse same 5x5 “filter”)• Breaking symmetry

©Möller/Mori 38


4!>6 3!>5 8!>2 2!>1 5!>3 4!>8 2!>8 3!>5 6!>5 7!>3

9!>4 8!>0 7!>8 5!>3 8!>7 0!>6 3!>7 2!>7 8!>3 9!>4

8!>2 5!>3 4!>8 3!>9 6!>0 9!>8 4!>9 6!>1 9!>4 9!>1

9!>4 2!>0 6!>1 3!>5 3!>2 9!>5 6!>0 6!>0 6!>0 6!>8

4!>6 7!>3 9!>4 4!>6 2!>7 9!>7 4!>3 9!>4 9!>4 9!>4

8!>7 4!>2 8!>4 3!>5 8!>4 6!>5 8!>5 3!>8 3!>8 9!>8

1!>5 9!>8 6!>3 0!>2 6!>5 9!>5 0!>7 1!>6 4!>9 2!>1

2!>8 8!>5 4!>9 7!>2 7!>2 6!>5 9!>7 6!>1 5!>6 5!>0

4!>9 2!>8

• The 82 errors made by LeNet5 (0.82% test error rate)

©Möller/Mori 39


Deep Learning

• Collection of important techniques to improve performance:• Multi-layer networks (as in LeNet 5), auto-encoders for

unsupervised feature learning, sparsity, convolutionalnetworks, parameter tying, probabilistic models for initializingparameters, hinge activation functions for steeper gradients

• ufldl.stanford.edu/eccv10-tutorial• http://www.image-net.org/challenges/LSVRC/2012/

supervision.pdf

• Scalability is key, can use lots of data since stochasticgradient descent is memory-efficient

©Möller/Mori 40

ufldl.stanford.edu/eccv10-tutorial

http://www.image-net.org/challenges/LSVRC/2012/supervision.pdf

http://www.image-net.org/challenges/LSVRC/2012/supervision.pdf


Conclusion

• Readings: Ch. 5.1, 5.2, 5.3• Feed-forward networks can be used for regression or

classification• Similar to linear models, except with adaptive non-linear basis

functions• These allow us to do more than e.g. linear decision

boundaries• Different error functions• Learning is more difficult, error function not convex

• Use stochastic gradient descent, obtain (good?) localminimum

• Backpropagation for efficient gradient computation

©Möller/Mori 41

neural networks - machine learningvda.univie.ac.at/teaching/ml/16s/lecturenotes/05... · for ﬁxed...

Documents