neural networks - machine learningvda.univie.ac.at/teaching/ml/16s/lecturenotes/05... · for fixed...

41
Feed-forward Networks Network Training Error Backpropagation Applications Neural Networks Machine Learning Torsten Möller ©Möller/Mori 1

Upload: others

Post on 20-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Neural NetworksMachine Learning

Torsten Möller

©Möller/Mori 1

Page 2: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Neural Networks

• Neural networks arise from attempts to model human/animalbrains

• Many models, many claims of biological plausibility• We will focus on multi-layer perceptrons

• Mathematical properties rather than plausibility

©Möller/Mori 2

Page 3: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Outline

Feed-forward Networks

Network Training

Error Backpropagation

Applications

©Möller/Mori 3

Page 4: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Outline

Feed-forward Networks

Network Training

Error Backpropagation

Applications

©Möller/Mori 4

Page 5: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Feed-forward Networks• We have looked at generalized linear models of the form:

y(x,w) = f

M∑j=1

wjϕj(x)

for fixed non-linear basis functions ϕ(·)

• We now extend this model by allowing adaptive basisfunctions, and learning their parameters

• In feed-forward networks (a.k.a. multi-layer perceptrons) welet each basis function be another non-linear function oflinear combination of the inputs:

ϕj(x) = f

M∑j=1

. . .

©Möller/Mori 5

Page 6: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Feed-forward Networks• We have looked at generalized linear models of the form:

y(x,w) = f

M∑j=1

wjϕj(x)

for fixed non-linear basis functions ϕ(·)

• We now extend this model by allowing adaptive basisfunctions, and learning their parameters

• In feed-forward networks (a.k.a. multi-layer perceptrons) welet each basis function be another non-linear function oflinear combination of the inputs:

ϕj(x) = f

M∑j=1

. . .

©Möller/Mori 6

Page 7: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Feed-forward Networks• Starting with input x = (x1, . . . , xD), construct linear

combinations:

aj =D∑i=1

w(1)ji xi + w

(1)j0

These aj are known as activations• Pass through an activation function h(·) to get outputzj = h(aj)

• Model of an individual neuron

from Russell and Norvig, AIMA2e ©Möller/Mori 7

Page 8: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Feed-forward Networks• Starting with input x = (x1, . . . , xD), construct linear

combinations:

aj =D∑i=1

w(1)ji xi + w

(1)j0

These aj are known as activations• Pass through an activation function h(·) to get outputzj = h(aj)

• Model of an individual neuron

from Russell and Norvig, AIMA2e ©Möller/Mori 8

Page 9: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Feed-forward Networks• Starting with input x = (x1, . . . , xD), construct linear

combinations:

aj =D∑i=1

w(1)ji xi + w

(1)j0

These aj are known as activations• Pass through an activation function h(·) to get outputzj = h(aj)

• Model of an individual neuron

from Russell and Norvig, AIMA2e ©Möller/Mori 9

Page 10: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Activation Functions

• Can use a variety of activation functions• Sigmoidal (S-shaped)

• Logistic sigmoid 1/(1 + exp(−a)) (useful for binaryclassification)

• Hyperbolic tangent tanh• Radial basis function zj =

∑i(xi − wji)

2

• Softmax• Useful for multi-class classification

• Identity• Useful for regression

• Threshold• …

• Needs to be differentiable for gradient-based learning (later)• Can use different activation functions in each unit

©Möller/Mori 10

Page 11: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Feed-forward Networks

x0

x1

xD

z0

z1

zM

y1

yK

w(1)MD w

(2)KM

w(2)10

hidden units

inputs outputs

• Connect together a number of these units into afeed-forward network (DAG)

• Above shows a network with one layer of hidden units• Implements function:

yk(x,w) = h

M∑j=1

w(2)kj h

(D∑i=1

w(1)ji xi + w

(1)j0

)+ w

(2)k0

©Möller/Mori 11

Page 12: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Outline

Feed-forward Networks

Network Training

Error Backpropagation

Applications

©Möller/Mori 12

Page 13: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Network Training• Given a specified network structure, how do we set its

parameters (weights)?• As usual, we define a criterion to measure how well our

network performs, optimize against it• For regression, training data are (xn, t), tn ∈ R

• Squared error naturally arises:

E(w) =

N∑n=1

{y(xn,w)− tn}2

• For binary classification, this is another discriminative model,ML:

p(t|w) =N∏

n=1

ytnn {1− yn}1−tn

E(w) = −N∑

n=1

{tn ln yn + (1− tn) ln(1− yn)}

©Möller/Mori 13

Page 14: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Network Training• Given a specified network structure, how do we set its

parameters (weights)?• As usual, we define a criterion to measure how well our

network performs, optimize against it• For regression, training data are (xn, t), tn ∈ R

• Squared error naturally arises:

E(w) =

N∑n=1

{y(xn,w)− tn}2

• For binary classification, this is another discriminative model,ML:

p(t|w) =N∏

n=1

ytnn {1− yn}1−tn

E(w) = −N∑

n=1

{tn ln yn + (1− tn) ln(1− yn)}

©Möller/Mori 14

Page 15: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Network Training• Given a specified network structure, how do we set its

parameters (weights)?• As usual, we define a criterion to measure how well our

network performs, optimize against it• For regression, training data are (xn, t), tn ∈ R

• Squared error naturally arises:

E(w) =

N∑n=1

{y(xn,w)− tn}2

• For binary classification, this is another discriminative model,ML:

p(t|w) =N∏

n=1

ytnn {1− yn}1−tn

E(w) = −N∑

n=1

{tn ln yn + (1− tn) ln(1− yn)}

©Möller/Mori 15

Page 16: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Parameter Optimization

w1

w2

E(w)

wA wB wC

∇E

• For either of these problems, the error function E(w) is nasty

• Nasty = non-convex• Non-convex = has local minima

©Möller/Mori 16

Page 17: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Descent Methods

• The typical strategy for optimization problems of this sort is adescent method:

w(τ+1) = w(τ) +∆w(τ)

• As we’ve seen before, these come in many flavours• Gradient descent ∇E(w(τ))• Stochastic gradient descent ∇En(w

(τ))• Newton-Raphson (second order) ∇2

• All of these can be used here, stochastic gradient descent isparticularly effective

• Redundancy in training data, escaping local minima

©Möller/Mori 17

Page 18: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Descent Methods

• The typical strategy for optimization problems of this sort is adescent method:

w(τ+1) = w(τ) +∆w(τ)

• As we’ve seen before, these come in many flavours• Gradient descent ∇E(w(τ))• Stochastic gradient descent ∇En(w

(τ))• Newton-Raphson (second order) ∇2

• All of these can be used here, stochastic gradient descent isparticularly effective

• Redundancy in training data, escaping local minima

©Möller/Mori 18

Page 19: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Descent Methods

• The typical strategy for optimization problems of this sort is adescent method:

w(τ+1) = w(τ) +∆w(τ)

• As we’ve seen before, these come in many flavours• Gradient descent ∇E(w(τ))• Stochastic gradient descent ∇En(w

(τ))• Newton-Raphson (second order) ∇2

• All of these can be used here, stochastic gradient descent isparticularly effective

• Redundancy in training data, escaping local minima

©Möller/Mori 19

Page 20: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Computing Gradients

• The function y(xn,w) implemented by a network iscomplicated

• It isn’t obvious how to compute error function derivatives withrespect to weights

• Numerical method for calculating error derivatives, use finitedifferences:

∂En

∂wji≈ En(wji + ϵ)− En(wji − ϵ)

• How much computation would this take with W weights inthe network?

• O(W ) per derivative, O(W 2) total per gradient descent step

©Möller/Mori 20

Page 21: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Computing Gradients

• The function y(xn,w) implemented by a network iscomplicated

• It isn’t obvious how to compute error function derivatives withrespect to weights

• Numerical method for calculating error derivatives, use finitedifferences:

∂En

∂wji≈ En(wji + ϵ)− En(wji − ϵ)

• How much computation would this take with W weights inthe network?

• O(W ) per derivative, O(W 2) total per gradient descent step

©Möller/Mori 21

Page 22: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Computing Gradients

• The function y(xn,w) implemented by a network iscomplicated

• It isn’t obvious how to compute error function derivatives withrespect to weights

• Numerical method for calculating error derivatives, use finitedifferences:

∂En

∂wji≈ En(wji + ϵ)− En(wji − ϵ)

• How much computation would this take with W weights inthe network?

• O(W ) per derivative, O(W 2) total per gradient descent step

©Möller/Mori 22

Page 23: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Computing Gradients

• The function y(xn,w) implemented by a network iscomplicated

• It isn’t obvious how to compute error function derivatives withrespect to weights

• Numerical method for calculating error derivatives, use finitedifferences:

∂En

∂wji≈ En(wji + ϵ)− En(wji − ϵ)

• How much computation would this take with W weights inthe network?

• O(W ) per derivative, O(W 2) total per gradient descent step

©Möller/Mori 23

Page 24: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Outline

Feed-forward Networks

Network Training

Error Backpropagation

Applications

©Möller/Mori 24

Page 25: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation

• Backprop is an efficient method for computing errorderivatives ∂En

∂wji

• O(W ) to compute derivatives wrt all weights• First, feed training example xn forward through the network,

storing all activations aj• Calculating derivatives for weights connected to output

nodes is easy• e.g. For linear output nodes yk =

∑i wkizi:

∂En

∂wki=

∂wki

1

2(yn − tn)

2 = (yn − tn)xni

• For hidden layers, propagate error backwards from theoutput nodes

©Möller/Mori 25

Page 26: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation

• Backprop is an efficient method for computing errorderivatives ∂En

∂wji

• O(W ) to compute derivatives wrt all weights• First, feed training example xn forward through the network,

storing all activations aj• Calculating derivatives for weights connected to output

nodes is easy• e.g. For linear output nodes yk =

∑i wkizi:

∂En

∂wki=

∂wki

1

2(yn − tn)

2 = (yn − tn)xni

• For hidden layers, propagate error backwards from theoutput nodes

©Möller/Mori 26

Page 27: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation

• Backprop is an efficient method for computing errorderivatives ∂En

∂wji

• O(W ) to compute derivatives wrt all weights• First, feed training example xn forward through the network,

storing all activations aj• Calculating derivatives for weights connected to output

nodes is easy• e.g. For linear output nodes yk =

∑i wkizi:

∂En

∂wki=

∂wki

1

2(yn − tn)

2 = (yn − tn)xni

• For hidden layers, propagate error backwards from theoutput nodes

©Möller/Mori 27

Page 28: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Chain Rule for Partial Derivatives

• A “reminder”• For f(x, y), with f differentiable wrt x and y, and x and y

differentiable wrt u and v:∂f

∂u=

∂f

∂x

∂x

∂u+

∂f

∂y

∂y

∂u

and

∂f

∂v=

∂f

∂x

∂x

∂v+

∂f

∂y

∂y

∂v

©Möller/Mori 28

Page 29: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation• We can write

∂En

∂wji=

∂wjiEn(aj1 , aj2 , . . . , ajm)

where {ji} are the indices of the nodes in the same layer asnode j

• Using the chain rule:∂En

∂wji=

∂En

∂aj

∂aj∂wji

+∑k

∂En

∂ak

∂ak∂wji

where∑

k runs over all other nodes k in the same layer asnode j.

• Since ak does not depend on wji, all terms in the summationgo to 0

∂En

∂wji=

∂En

∂aj

∂aj∂wji

©Möller/Mori 29

Page 30: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation• We can write

∂En

∂wji=

∂wjiEn(aj1 , aj2 , . . . , ajm)

where {ji} are the indices of the nodes in the same layer asnode j

• Using the chain rule:∂En

∂wji=

∂En

∂aj

∂aj∂wji

+∑k

∂En

∂ak

∂ak∂wji

where∑

k runs over all other nodes k in the same layer asnode j.

• Since ak does not depend on wji, all terms in the summationgo to 0

∂En

∂wji=

∂En

∂aj

∂aj∂wji

©Möller/Mori 30

Page 31: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation• We can write

∂En

∂wji=

∂wjiEn(aj1 , aj2 , . . . , ajm)

where {ji} are the indices of the nodes in the same layer asnode j

• Using the chain rule:∂En

∂wji=

∂En

∂aj

∂aj∂wji

+∑k

∂En

∂ak

∂ak∂wji

where∑

k runs over all other nodes k in the same layer asnode j.

• Since ak does not depend on wji, all terms in the summationgo to 0

∂En

∂wji=

∂En

∂aj

∂aj∂wji

©Möller/Mori 31

Page 32: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation cont.

• Introduce error δj ≡ ∂En∂aj

∂En

∂wji= δj

∂aj∂wji

• Other factor is:∂aj∂wji

=∂

∂wji

∑k

wjkzk = zi

©Möller/Mori 32

Page 33: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation cont.

• Error δj can also be computed using chain rule:

δj ≡∂En

∂aj=∑k

∂En

∂ak︸︷︷︸δk

∂ak∂aj

where∑

k runs over all nodes k in the layer after node j.• Eventually:

δj = h′(aj)∑k

wkjδk

• A weighted sum of the later error “caused” by this weight

©Möller/Mori 33

Page 34: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Error Backpropagation cont.

• Error δj can also be computed using chain rule:

δj ≡∂En

∂aj=∑k

∂En

∂ak︸︷︷︸δk

∂ak∂aj

where∑

k runs over all nodes k in the layer after node j.• Eventually:

δj = h′(aj)∑k

wkjδk

• A weighted sum of the later error “caused” by this weight

©Möller/Mori 34

Page 35: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Outline

Feed-forward Networks

Network Training

Error Backpropagation

Applications

©Möller/Mori 35

Page 36: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Applications of Neural Networks

• Many success stories for neural networks• Credit card fraud detection• Hand-written digit recognition• Face detection• Autonomous driving (CMU ALVINN)

©Möller/Mori 36

Page 37: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Hand-written Digit Recognition

• MNIST - standard dataset for hand-written digit recognition• 60000 training, 10000 test images

©Möller/Mori 37

Page 38: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

LeNet-5

INPUT

32x32

Convolutions SubsamplingConvolutions

C1: feature maps 6@28x28

Subsampling

S2: f. maps6@14x14

S4: f. maps 16@5x5

C5: layer120

C3: f. maps 16@10x10

F6: layer 84

Full connection

Full connection

Gaussian connections

OUTPUT

10

• LeNet developed by Yann LeCun et al.• Convolutional neural network

• Local receptive fields (5x5 connectivity)• Subsampling (2x2)• Shared weights (reuse same 5x5 “filter”)• Breaking symmetry

©Möller/Mori 38

Page 39: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

4!>6 3!>5 8!>2 2!>1 5!>3 4!>8 2!>8 3!>5 6!>5 7!>3

9!>4 8!>0 7!>8 5!>3 8!>7 0!>6 3!>7 2!>7 8!>3 9!>4

8!>2 5!>3 4!>8 3!>9 6!>0 9!>8 4!>9 6!>1 9!>4 9!>1

9!>4 2!>0 6!>1 3!>5 3!>2 9!>5 6!>0 6!>0 6!>0 6!>8

4!>6 7!>3 9!>4 4!>6 2!>7 9!>7 4!>3 9!>4 9!>4 9!>4

8!>7 4!>2 8!>4 3!>5 8!>4 6!>5 8!>5 3!>8 3!>8 9!>8

1!>5 9!>8 6!>3 0!>2 6!>5 9!>5 0!>7 1!>6 4!>9 2!>1

2!>8 8!>5 4!>9 7!>2 7!>2 6!>5 9!>7 6!>1 5!>6 5!>0

4!>9 2!>8

• The 82 errors made by LeNet5 (0.82% test error rate)

©Möller/Mori 39

Page 40: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Deep Learning

• Collection of important techniques to improve performance:• Multi-layer networks (as in LeNet 5), auto-encoders for

unsupervised feature learning, sparsity, convolutionalnetworks, parameter tying, probabilistic models for initializingparameters, hinge activation functions for steeper gradients

• ufldl.stanford.edu/eccv10-tutorial• http://www.image-net.org/challenges/LSVRC/2012/

supervision.pdf

• Scalability is key, can use lots of data since stochasticgradient descent is memory-efficient

©Möller/Mori 40

Page 41: Neural Networks - Machine Learningvda.univie.ac.at/Teaching/ML/16s/LectureNotes/05... · for fixed non-linear basis functionsϕ( ) • We now extend this model by allowing adaptive

Feed-forward Networks Network Training Error Backpropagation Applications

Conclusion

• Readings: Ch. 5.1, 5.2, 5.3• Feed-forward networks can be used for regression or

classification• Similar to linear models, except with adaptive non-linear basis

functions• These allow us to do more than e.g. linear decision

boundaries• Different error functions• Learning is more difficult, error function not convex

• Use stochastic gradient descent, obtain (good?) localminimum

• Backpropagation for efficient gradient computation

©Möller/Mori 41