feedforward neural nets and backpropagation · feedforward neural nets and backpropagation julie...

85
Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28 th , 2016 1 / 23

Upload: others

Post on 15-Jun-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Feedforward Neural Nets and Backpropagation

Julie Nutini

University of British Columbia

MLRG

September 28th, 2016

1 / 23

Page 2: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Supervised Learning Roadmap

Supervised Learning:Assume that we are given the features xi.Could also use basis functions or kernels.

Unsupervised Learning:Learn a representation zi based on features xi.Also used for supervised learning: use zi as features.

Supervised Learning:Learn features zi that are good for supervised learning.

2 / 23

Page 3: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Supervised Learning Roadmap

Supervised Learning:Assume that we are given the features xi.Could also use basis functions or kernels.

Unsupervised Learning:Learn a representation zi based on features xi.Also used for supervised learning: use zi as features.

Supervised Learning:Learn features zi that are good for supervised learning.

2 / 23

Page 4: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Supervised Learning Roadmap

Supervised Learning:Assume that we are given the features xi.Could also use basis functions or kernels.

Unsupervised Learning:Learn a representation zi based on features xi.Also used for supervised learning: use zi as features.

Supervised Learning:Learn features zi that are good for supervised learning.

2 / 23

Page 5: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Supervised Learning Roadmap

Linear Model

yi

xi1 xi2 xi3 xid

w1 w2 w3 wd

→ These are all examples of Feedforward Neural Networks.

3 / 23

Page 6: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Supervised Learning Roadmap

Linear Model

yi

xi1 xi2 xi3 xid

w1 w2 w3 wd

Change of Basis

xi1 xi2 xi3 xid…

w3w2w1 wk

yi

Basis

zi1 zi2 zi3 zik

→ These are all examples of Feedforward Neural Networks.

3 / 23

Page 7: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Supervised Learning Roadmap

Linear Model

yi

xi1 xi2 xi3 xid

w1 w2 w3 wd

Change of Basis

xi1 xi2 xi3 xid…

w3w2w1 wk

yi

Basis

zi1 zi2 zi3 zik

Basis from Latent-FactorModel

w3w2w1 wk

zi1 zi2 zi3 zik

yi

zi1 zi2 zi3 zik

Wkd

xi1 xi2 xi3 xid

w and W are trained separately

W11

→ These are all examples of Feedforward Neural Networks.

3 / 23

Page 8: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Supervised Learning Roadmap

Linear Model

yi

xi1 xi2 xi3 xid

w1 w2 w3 wd

Change of Basis

xi1 xi2 xi3 xid…

w3w2w1 wk

yi

Basis

zi1 zi2 zi3 zik

Basis from Latent-FactorModel

w3w2w1 wk

zi1 zi2 zi3 zik

yi

zi1 zi2 zi3 zik

Wkd

xi1 xi2 xi3 xid

w and W are trained separately

W11

Simultaneously LearnFeatures for Task and

Regression Model

w1

WkdW11

wkw3w2

…zi1 zi2 zi3 zik

yi

xidxi2 xi3xi1

→ These are all examples of Feedforward Neural Networks.

3 / 23

Page 9: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Supervised Learning Roadmap

Linear Model

yi

xi1 xi2 xi3 xid

w1 w2 w3 wd

Change of Basis

xi1 xi2 xi3 xid…

w3w2w1 wk

yi

Basis

zi1 zi2 zi3 zik

Basis from Latent-FactorModel

w3w2w1 wk

zi1 zi2 zi3 zik

yi

zi1 zi2 zi3 zik

Wkd

xi1 xi2 xi3 xid

w and W are trained separately

W11

Simultaneously LearnFeatures for Task and

Regression Model

w1

WkdW11

wkw3w2

…zi1 zi2 zi3 zik

yi

xidxi2 xi3xi1

→ These are all examples of Feedforward Neural Networks.

3 / 23

Page 10: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Feed-forward Neural Network

Information always moves one direction.No loops.Never goes backwards.Forms a directed acyclic graph.

Each node receives input only from immediately preceding layer.

Simplest type of artificial neural network.

4 / 23

Page 11: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Feed-forward Neural Network

Information always moves one direction.No loops.Never goes backwards.Forms a directed acyclic graph.

Each node receives input only from immediately preceding layer.

Simplest type of artificial neural network.

4 / 23

Page 12: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Neural Networks - Historical Background

1943: McCulloch and Pitts proposed first computational model of neuron

1949: Hebb proposed the first learning rule

1958: Rosenblatt’s work on perceptrons

1969: Minsky and Papert’s paper exposed limits of theory

1970s: Decade of dormancy for neural networks

1980-90s: Neural network return (self-organization, back-propagationalgorithms, etc.)

5 / 23

Page 13: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Model of Single Neuron

McCulloch and Pitts (1943): “integrate and fire” model (no hidden layers)yi

xi1 xi2 xi3 xid

w1 w2 w3 wd

Denote the d input values for sample i by xi1, xi2, . . . , xid.

Each of the d inputs has a weight w1, w2, . . . , wd.

Compute prediction as weighted sum,

yi = w1xi1 + w2xi2 + · · ·+ wdxid =d∑

j=1

wjxij

Use yi in some loss function:

1

2(yi − yi)

2

6 / 23

Page 14: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Model of Single Neuron

McCulloch and Pitts (1943): “integrate and fire” model (no hidden layers)yi

xi1 xi2 xi3 xid

w1 w2 w3 wd

Denote the d input values for sample i by xi1, xi2, . . . , xid.

Each of the d inputs has a weight w1, w2, . . . , wd.

Compute prediction as weighted sum,

yi = w1xi1 + w2xi2 + · · ·+ wdxid =

d∑j=1

wjxij

Use yi in some loss function:

1

2(yi − yi)

2

6 / 23

Page 15: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Model of Single Neuron

McCulloch and Pitts (1943): “integrate and fire” model (no hidden layers)yi

xi1 xi2 xi3 xid

w1 w2 w3 wd

Denote the d input values for sample i by xi1, xi2, . . . , xid.

Each of the d inputs has a weight w1, w2, . . . , wd.

Compute prediction as weighted sum,

yi = w1xi1 + w2xi2 + · · ·+ wdxid =

d∑j=1

wjxij

Use yi in some loss function:

1

2(yi − yi)

2

6 / 23

Page 16: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Incorporating Hidden Layers

Consider more than one neuron:

w1

WkdW11

wkw3w2

…zi1 zi2 zi3 zik

yi

xidxi2 xi3xi1

Input to hidden layer: function h of features from latent-factor model:

zi = h(Wxi).

Each neuron has directed connection to ALL neurons of a subsequent layer.

This function h is often called the activation function.Each unit/node applies an activation function.

7 / 23

Page 17: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Incorporating Hidden Layers

Consider more than one neuron:

w1

WkdW11

wkw3w2

…zi1 zi2 zi3 zik

yi

xidxi2 xi3xi1

Input to hidden layer: function h of features from latent-factor model:

zi = h(Wxi).

Each neuron has directed connection to ALL neurons of a subsequent layer.

This function h is often called the activation function.Each unit/node applies an activation function.

7 / 23

Page 18: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Incorporating Hidden Layers

Consider more than one neuron:

w1

WkdW11

wkw3w2

…zi1 zi2 zi3 zik

yi

xidxi2 xi3xi1

Input to hidden layer: function h of features from latent-factor model:

zi = h(Wxi).

Each neuron has directed connection to ALL neurons of a subsequent layer.

This function h is often called the activation function.Each unit/node applies an activation function.

7 / 23

Page 19: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Linear Activation Function

A linear activation function has the form

h(Wxi) = a+Wxi,

where a is called bias (intercept).

Example: linear regression with linear bias (linear-linear model)Representation: zi = h(Wxi) (from latent-factor model)Prediction: yi = wT zi

Loss: 12(yi − yi)2

To train this model, we solve:

argminw∈IRk,W∈IRk×d

1

2

n∑i=1

(yi − wT zi)2

︸ ︷︷ ︸linear regression with zi as features

=1

2

n∑i=1

(yi − wTh(Wxi))2

But this is just a linear model:

wT (Wxi) = (WTw)Txi = wTxi

8 / 23

Page 20: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Linear Activation Function

A linear activation function has the form

h(Wxi) = a+Wxi,

where a is called bias (intercept).

Example: linear regression with linear bias (linear-linear model)Representation: zi = h(Wxi) (from latent-factor model)Prediction: yi = wT zi

Loss: 12(yi − yi)2

To train this model, we solve:

argminw∈IRk,W∈IRk×d

1

2

n∑i=1

(yi − wT zi)2

︸ ︷︷ ︸linear regression with zi as features

=1

2

n∑i=1

(yi − wTh(Wxi))2

But this is just a linear model:

wT (Wxi) = (WTw)Txi = wTxi

8 / 23

Page 21: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Linear Activation Function

A linear activation function has the form

h(Wxi) = a+Wxi,

where a is called bias (intercept).

Example: linear regression with linear bias (linear-linear model)Representation: zi = h(Wxi) (from latent-factor model)Prediction: yi = wT zi

Loss: 12(yi − yi)2

To train this model, we solve:

argminw∈IRk,W∈IRk×d

1

2

n∑i=1

(yi − wT zi)2

︸ ︷︷ ︸linear regression with zi as features

=1

2

n∑i=1

(yi − wTh(Wxi))2

But this is just a linear model:

wT (Wxi) = (WTw)Txi = wTxi

8 / 23

Page 22: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Linear Activation Function

A linear activation function has the form

h(Wxi) = a+Wxi,

where a is called bias (intercept).

Example: linear regression with linear bias (linear-linear model)Representation: zi = h(Wxi) (from latent-factor model)Prediction: yi = wT zi

Loss: 12(yi − yi)2

To train this model, we solve:

argminw∈IRk,W∈IRk×d

1

2

n∑i=1

(yi − wT zi)2

︸ ︷︷ ︸linear regression with zi as features

=1

2

n∑i=1

(yi − wTh(Wxi))2

But this is just a linear model:

wT (Wxi) = (WTw)Txi = wTxi

8 / 23

Page 23: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Binary Activation Function

To increase flexibility, something needs to be non-linear.

A Heaviside step function has the form

h(v) =

1 if v ≥ a

0 otherwise

where a is the threshold.Example: Let a = 0,

This yields a binary zi = h(Wxi).Wxi has a concept encoded by each of its 2k possible signs.

9 / 23

Page 24: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Binary Activation Function

To increase flexibility, something needs to be non-linear.A Heaviside step function has the form

h(v) =

1 if v ≥ a

0 otherwise

where a is the threshold.

Example: Let a = 0,

This yields a binary zi = h(Wxi).Wxi has a concept encoded by each of its 2k possible signs.

9 / 23

Page 25: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Binary Activation Function

To increase flexibility, something needs to be non-linear.A Heaviside step function has the form

h(v) =

1 if v ≥ a

0 otherwise

where a is the threshold.Example: Let a = 0,

This yields a binary zi = h(Wxi).

Wxi has a concept encoded by each of its 2k possible signs.

9 / 23

Page 26: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Binary Activation Function

To increase flexibility, something needs to be non-linear.A Heaviside step function has the form

h(v) =

1 if v ≥ a

0 otherwise

where a is the threshold.Example: Let a = 0,

This yields a binary zi = h(Wxi).Wxi has a concept encoded by each of its 2k possible signs.

9 / 23

Page 27: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Perceptrons

Minsky and Papert (late 50s).

Algorithm for supervised learning of binary classifiers.Decides whether input belongs to a specific class or not.

Uses binary activation function.Can learn “AND”, “OR”, “NOT” functions.

Perceptrons only capable of learning linearly separable patterns.

10 / 23

Page 28: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Perceptrons

Minsky and Papert (late 50s).

Algorithm for supervised learning of binary classifiers.Decides whether input belongs to a specific class or not.

Uses binary activation function.Can learn “AND”, “OR”, “NOT” functions.

Perceptrons only capable of learning linearly separable patterns.

10 / 23

Page 29: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Perceptrons

Minsky and Papert (late 50s).

Algorithm for supervised learning of binary classifiers.Decides whether input belongs to a specific class or not.

Uses binary activation function.Can learn “AND”, “OR”, “NOT” functions.

Perceptrons only capable of learning linearly separable patterns.

10 / 23

Page 30: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Learning Algorithm for Perceptrons

The perceptron learning rule is given as follows:1 For each xi and desired output yi in training set,

1 Calculate output: y(t)i = h

((w(t)

)Txi

).

2 Update the weights: w(t+1)j = w

(t)j + (yi − y(t)i )xji ∀ 1 ≤ j ≤ d.

2 Continue these steps until 12

∑ni=1

∣∣∣yi − y(t)i

∣∣∣ < γ (threshold).

If training set is linearly separable, then guaranteed to converge.(Rosenblatt, 1962).

An upper bound exists on number of times weights adjusted in training.

Can also use gradient descent if function is differentiable.

11 / 23

Page 31: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Learning Algorithm for Perceptrons

The perceptron learning rule is given as follows:1 For each xi and desired output yi in training set,

1 Calculate output: y(t)i = h

((w(t)

)Txi

).

2 Update the weights: w(t+1)j = w

(t)j + (yi − y(t)i )xji ∀ 1 ≤ j ≤ d.

2 Continue these steps until 12

∑ni=1

∣∣∣yi − y(t)i

∣∣∣ < γ (threshold).

If training set is linearly separable, then guaranteed to converge.(Rosenblatt, 1962).

An upper bound exists on number of times weights adjusted in training.

Can also use gradient descent if function is differentiable.

11 / 23

Page 32: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Learning Algorithm for Perceptrons

The perceptron learning rule is given as follows:1 For each xi and desired output yi in training set,

1 Calculate output: y(t)i = h

((w(t)

)Txi

).

2 Update the weights: w(t+1)j = w

(t)j + (yi − y(t)i )xji ∀ 1 ≤ j ≤ d.

2 Continue these steps until 12

∑ni=1

∣∣∣yi − y(t)i

∣∣∣ < γ (threshold).

If training set is linearly separable, then guaranteed to converge.(Rosenblatt, 1962).

An upper bound exists on number of times weights adjusted in training.

Can also use gradient descent if function is differentiable.

11 / 23

Page 33: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Learning Algorithm for Perceptrons

The perceptron learning rule is given as follows:1 For each xi and desired output yi in training set,

1 Calculate output: y(t)i = h

((w(t)

)Txi

).

2 Update the weights: w(t+1)j = w

(t)j + (yi − y(t)i )xji ∀ 1 ≤ j ≤ d.

2 Continue these steps until 12

∑ni=1

∣∣∣yi − y(t)i

∣∣∣ < γ (threshold).

If training set is linearly separable, then guaranteed to converge.(Rosenblatt, 1962).

An upper bound exists on number of times weights adjusted in training.

Can also use gradient descent if function is differentiable.

11 / 23

Page 34: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Learning Algorithm for Perceptrons

The perceptron learning rule is given as follows:1 For each xi and desired output yi in training set,

1 Calculate output: y(t)i = h

((w(t)

)Txi

).

2 Update the weights: w(t+1)j = w

(t)j + (yi − y(t)i )xji ∀ 1 ≤ j ≤ d.

2 Continue these steps until 12

∑ni=1

∣∣∣yi − y(t)i

∣∣∣ < γ (threshold).

If training set is linearly separable, then guaranteed to converge.(Rosenblatt, 1962).

An upper bound exists on number of times weights adjusted in training.

Can also use gradient descent if function is differentiable.

11 / 23

Page 35: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Learning Algorithm for Perceptrons

The perceptron learning rule is given as follows:1 For each xi and desired output yi in training set,

1 Calculate output: y(t)i = h

((w(t)

)Txi

).

2 Update the weights: w(t+1)j = w

(t)j + (yi − y(t)i )xji ∀ 1 ≤ j ≤ d.

2 Continue these steps until 12

∑ni=1

∣∣∣yi − y(t)i

∣∣∣ < γ (threshold).

If training set is linearly separable, then guaranteed to converge.(Rosenblatt, 1962).

An upper bound exists on number of times weights adjusted in training.

Can also use gradient descent if function is differentiable.

11 / 23

Page 36: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Non-Linear Continuous Activation Function

What about a smooth approximation to the step function?

A sigmoid function has the form

h(v) =1

1 + e−v

Applying the sigmoid function element-wise,

zic =1

1 + e−WTc xi

.

This is called a multi-layer perceptron or neural network.

12 / 23

Page 37: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Non-Linear Continuous Activation Function

What about a smooth approximation to the step function?

A sigmoid function has the form

h(v) =1

1 + e−v

Applying the sigmoid function element-wise,

zic =1

1 + e−WTc xi

.

This is called a multi-layer perceptron or neural network.

12 / 23

Page 38: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Non-Linear Continuous Activation Function

What about a smooth approximation to the step function?

A sigmoid function has the form

h(v) =1

1 + e−v

Applying the sigmoid function element-wise,

zic =1

1 + e−WTc xi

.

This is called a multi-layer perceptron or neural network.12 / 23

Page 39: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Why Neural Network?

A “typical” neuron.

Neuron has many dendrites.Each dendrite “takes” input.

Neuron has a single axon.Axon “sends” output.

With the “right” input to dendrites:Action potential along axon.

13 / 23

Page 40: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Why Neural Network?

A “typical” neuron.

Neuron has many dendrites.Each dendrite “takes” input.

Neuron has a single axon.Axon “sends” output.

With the “right” input to dendrites:Action potential along axon.

13 / 23

Page 41: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Why Neural Network?

First neuron:Each dendrite: xi1, xi1, . . . , xidNucleus: computes WT

c xi

Axon: sends binary signal 1

1+e−WTc xi

Axon terminal: neurotransmitter, synapse

Second neuron:Each dendrite receives: zi1, zi2, . . . , zikNucleus: computes wT zi

Axon: sends signal yi

Describes a neural network with one hidden layer (2 neurons).

Human brain has approximately 1011 neurons.

Each neuron connected to approximately 104 neurons.

. . .

14 / 23

Page 42: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Why Neural Network?

First neuron:Each dendrite: xi1, xi1, . . . , xidNucleus: computes WT

c xi

Axon: sends binary signal 1

1+e−WTc xi

Axon terminal: neurotransmitter, synapse

Second neuron:Each dendrite receives: zi1, zi2, . . . , zikNucleus: computes wT zi

Axon: sends signal yi

Describes a neural network with one hidden layer (2 neurons).

Human brain has approximately 1011 neurons.

Each neuron connected to approximately 104 neurons.

. . .

14 / 23

Page 43: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Why Neural Network?

First neuron:Each dendrite: xi1, xi1, . . . , xidNucleus: computes WT

c xi

Axon: sends binary signal 1

1+e−WTc xi

Axon terminal: neurotransmitter, synapse

Second neuron:Each dendrite receives: zi1, zi2, . . . , zikNucleus: computes wT zi

Axon: sends signal yi

Describes a neural network with one hidden layer (2 neurons).

Human brain has approximately 1011 neurons.

Each neuron connected to approximately 104 neurons.

. . .

14 / 23

Page 44: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Why Neural Network?

First neuron:Each dendrite: xi1, xi1, . . . , xidNucleus: computes WT

c xi

Axon: sends binary signal 1

1+e−WTc xi

Axon terminal: neurotransmitter, synapse

Second neuron:Each dendrite receives: zi1, zi2, . . . , zikNucleus: computes wT zi

Axon: sends signal yi

Describes a neural network with one hidden layer (2 neurons).

Human brain has approximately 1011 neurons.

Each neuron connected to approximately 104 neurons.

. . .

14 / 23

Page 45: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Why Neural Network?

First neuron:Each dendrite: xi1, xi1, . . . , xidNucleus: computes WT

c xi

Axon: sends binary signal 1

1+e−WTc xi

Axon terminal: neurotransmitter, synapse

Second neuron:Each dendrite receives: zi1, zi2, . . . , zikNucleus: computes wT zi

Axon: sends signal yi

Describes a neural network with one hidden layer (2 neurons).

Human brain has approximately 1011 neurons.

Each neuron connected to approximately 104 neurons.

. . .

14 / 23

Page 46: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Artificial Neural Nets vs. Real Neural Nets

Artificial neural network:xi is measuring of the world.zi is internal representation of the world.yi as output of neuron for classification/regression.

w1

WkdW11

wkw3w2

…zi1 zi2 zi3 zik

yi

xidxi2 xi3xi1

Real neural networks are more complicated:Timing of action potentials seems to be important.

Rate coding: frequency of action potentials simulates continuous output.

Neural networks don’t reflect sparsity of action potentials.How much computation is done inside neuron?Brain is highly organized (e.g., substructures and cortical columns).Connection structure changes.Different types of neurotransmitters.

15 / 23

Page 47: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Artificial Neural Nets vs. Real Neural Nets

Artificial neural network:xi is measuring of the world.zi is internal representation of the world.yi as output of neuron for classification/regression.

w1

WkdW11

wkw3w2

…zi1 zi2 zi3 zik

yi

xidxi2 xi3xi1

Real neural networks are more complicated:Timing of action potentials seems to be important.

Rate coding: frequency of action potentials simulates continuous output.

Neural networks don’t reflect sparsity of action potentials.How much computation is done inside neuron?Brain is highly organized (e.g., substructures and cortical columns).Connection structure changes.Different types of neurotransmitters.

15 / 23

Page 48: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Artificial Neural Nets vs. Real Neural Nets

Artificial neural network:xi is measuring of the world.zi is internal representation of the world.yi as output of neuron for classification/regression.

w1

WkdW11

wkw3w2

…zi1 zi2 zi3 zik

yi

xidxi2 xi3xi1

Real neural networks are more complicated:Timing of action potentials seems to be important.

Rate coding: frequency of action potentials simulates continuous output.

Neural networks don’t reflect sparsity of action potentials.

How much computation is done inside neuron?Brain is highly organized (e.g., substructures and cortical columns).Connection structure changes.Different types of neurotransmitters.

15 / 23

Page 49: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Artificial Neural Nets vs. Real Neural Nets

Artificial neural network:xi is measuring of the world.zi is internal representation of the world.yi as output of neuron for classification/regression.

w1

WkdW11

wkw3w2

…zi1 zi2 zi3 zik

yi

xidxi2 xi3xi1

Real neural networks are more complicated:Timing of action potentials seems to be important.

Rate coding: frequency of action potentials simulates continuous output.

Neural networks don’t reflect sparsity of action potentials.How much computation is done inside neuron?

Brain is highly organized (e.g., substructures and cortical columns).Connection structure changes.Different types of neurotransmitters.

15 / 23

Page 50: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Artificial Neural Nets vs. Real Neural Nets

Artificial neural network:xi is measuring of the world.zi is internal representation of the world.yi as output of neuron for classification/regression.

w1

WkdW11

wkw3w2

…zi1 zi2 zi3 zik

yi

xidxi2 xi3xi1

Real neural networks are more complicated:Timing of action potentials seems to be important.

Rate coding: frequency of action potentials simulates continuous output.

Neural networks don’t reflect sparsity of action potentials.How much computation is done inside neuron?Brain is highly organized (e.g., substructures and cortical columns).Connection structure changes.

Different types of neurotransmitters.

15 / 23

Page 51: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Artificial Neural Nets vs. Real Neural Nets

Artificial neural network:xi is measuring of the world.zi is internal representation of the world.yi as output of neuron for classification/regression.

w1

WkdW11

wkw3w2

…zi1 zi2 zi3 zik

yi

xidxi2 xi3xi1

Real neural networks are more complicated:Timing of action potentials seems to be important.

Rate coding: frequency of action potentials simulates continuous output.

Neural networks don’t reflect sparsity of action potentials.How much computation is done inside neuron?Brain is highly organized (e.g., substructures and cortical columns).Connection structure changes.Different types of neurotransmitters.

15 / 23

Page 52: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

The Universal Approximation Theorem

One of the first versions proven by George Cybenko (1989).

For feedforward neural networks, the universal approximation theorem:“... claims that every continuous function defined on a compact set can bearbitrarily well approximated by a neural network with one hidden layer”.

Thus, a simple neural network capable of representing a wide variety offunctions when given appropriate parameters.

“Algorithmic learnability” of those parameters?

16 / 23

Page 53: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

The Universal Approximation Theorem

One of the first versions proven by George Cybenko (1989).

For feedforward neural networks, the universal approximation theorem:“... claims that every continuous function defined on a compact set can bearbitrarily well approximated by a neural network with one hidden layer”.

Thus, a simple neural network capable of representing a wide variety offunctions when given appropriate parameters.

“Algorithmic learnability” of those parameters?

16 / 23

Page 54: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

The Universal Approximation Theorem

One of the first versions proven by George Cybenko (1989).

For feedforward neural networks, the universal approximation theorem:“... claims that every continuous function defined on a compact set can bearbitrarily well approximated by a neural network with one hidden layer”.

Thus, a simple neural network capable of representing a wide variety offunctions when given appropriate parameters.

“Algorithmic learnability” of those parameters?

16 / 23

Page 55: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

The Universal Approximation Theorem

One of the first versions proven by George Cybenko (1989).

For feedforward neural networks, the universal approximation theorem:“... claims that every continuous function defined on a compact set can bearbitrarily well approximated by a neural network with one hidden layer”.

Thus, a simple neural network capable of representing a wide variety offunctions when given appropriate parameters.

“Algorithmic learnability” of those parameters?

16 / 23

Page 56: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Artificial Neural Networks

With squared loss, our objective function is:

argminw∈IRk,W∈IRk×d

1

2

n∑i=1

(yi − wTh(Wxi))2

Usual training procedure: stochastic gradient.Compute gradient of random example i, update w and W .

Computing the gradient is known as backpropagation.

Adding regularization to w and/or W is known as weight decay.

17 / 23

Page 57: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Artificial Neural Networks

With squared loss, our objective function is:

argminw∈IRk,W∈IRk×d

1

2

n∑i=1

(yi − wTh(Wxi))2

Usual training procedure: stochastic gradient.Compute gradient of random example i, update w and W .

Computing the gradient is known as backpropagation.

Adding regularization to w and/or W is known as weight decay.

17 / 23

Page 58: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Artificial Neural Networks

With squared loss, our objective function is:

argminw∈IRk,W∈IRk×d

1

2

n∑i=1

(yi − wTh(Wxi))2

Usual training procedure: stochastic gradient.Compute gradient of random example i, update w and W .

Computing the gradient is known as backpropagation.

Adding regularization to w and/or W is known as weight decay.

17 / 23

Page 59: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Backpropagation

Output values yi compared to correct values yi to compute error.

Error is fed back through the network (backpropagation of errors).

Weights are adjusted to reduce value of error function by small amount.

After number of cycles, network converges to state where error is small.

To adjust weights, use non-linear optimization method gradient descent.→ Backpropagation can only be applied to differentiable activation functions.

18 / 23

Page 60: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Backpropagation

Output values yi compared to correct values yi to compute error.

Error is fed back through the network (backpropagation of errors).

Weights are adjusted to reduce value of error function by small amount.

After number of cycles, network converges to state where error is small.

To adjust weights, use non-linear optimization method gradient descent.→ Backpropagation can only be applied to differentiable activation functions.

18 / 23

Page 61: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Backpropagation

Output values yi compared to correct values yi to compute error.

Error is fed back through the network (backpropagation of errors).

Weights are adjusted to reduce value of error function by small amount.

After number of cycles, network converges to state where error is small.

To adjust weights, use non-linear optimization method gradient descent.→ Backpropagation can only be applied to differentiable activation functions.

18 / 23

Page 62: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Backpropagation

Output values yi compared to correct values yi to compute error.

Error is fed back through the network (backpropagation of errors).

Weights are adjusted to reduce value of error function by small amount.

After number of cycles, network converges to state where error is small.

To adjust weights, use non-linear optimization method gradient descent.→ Backpropagation can only be applied to differentiable activation functions.

18 / 23

Page 63: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Backpropagation

Output values yi compared to correct values yi to compute error.

Error is fed back through the network (backpropagation of errors).

Weights are adjusted to reduce value of error function by small amount.

After number of cycles, network converges to state where error is small.

To adjust weights, use non-linear optimization method gradient descent.→ Backpropagation can only be applied to differentiable activation functions.

18 / 23

Page 64: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Backpropagation Algorithm

repeat

for each weight wi,j in the network dowi,j ← a small random number

for each example (x, y) do/*Propagate the inputs forward to compute the outputs*/for each node i in the input layer do

ai ← xi

for ` = 2 to L dofor each node j in layer ` do

vj ←∑

i wi,jai

aj ← h(vj)

/*Propagate the deltas backwards from output layer to input layer*/for each node j in output layer do

∆[j]← h′(vj)(yj − aj)

for ` = L− 1 to 1 dofor each node i in layer ` do

∆[i]← h′(vi)∑

j wi,j∆[j]

/*Update every weight in network using deltas*/for each weight wi,j in network do

wi,j ← wi,j + αai∆[j]

until some stopping criterion is satisfied

19 / 23

Page 65: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Backpropagation Algorithm

repeatfor each weight wi,j in the network do

wi,j ← a small random number

for each example (x, y) do

/*Propagate the inputs forward to compute the outputs*/for each node i in the input layer do

ai ← xi

for ` = 2 to L dofor each node j in layer ` do

vj ←∑

i wi,jai

aj ← h(vj)

/*Propagate the deltas backwards from output layer to input layer*/for each node j in output layer do

∆[j]← h′(vj)(yj − aj)

for ` = L− 1 to 1 dofor each node i in layer ` do

∆[i]← h′(vi)∑

j wi,j∆[j]

/*Update every weight in network using deltas*/for each weight wi,j in network do

wi,j ← wi,j + αai∆[j]

until some stopping criterion is satisfied

19 / 23

Page 66: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Backpropagation Algorithm

repeatfor each weight wi,j in the network do

wi,j ← a small random number

for each example (x, y) do/*Propagate the inputs forward to compute the outputs*/

for each node i in the input layer doai ← xi

for ` = 2 to L dofor each node j in layer ` do

vj ←∑

i wi,jai

aj ← h(vj)

/*Propagate the deltas backwards from output layer to input layer*/

for each node j in output layer do∆[j]← h′(vj)(yj − aj)

for ` = L− 1 to 1 dofor each node i in layer ` do

∆[i]← h′(vi)∑

j wi,j∆[j]

/*Update every weight in network using deltas*/

for each weight wi,j in network dowi,j ← wi,j + αai∆[j]

until some stopping criterion is satisfied

19 / 23

Page 67: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Backpropagation Algorithm

repeatfor each weight wi,j in the network do

wi,j ← a small random number

for each example (x, y) do/*Propagate the inputs forward to compute the outputs*/for each node i in the input layer do

ai ← xi

for ` = 2 to L dofor each node j in layer ` do

vj ←∑

i wi,jai

aj ← h(vj)

/*Propagate the deltas backwards from output layer to input layer*/

for each node j in output layer do∆[j]← h′(vj)(yj − aj)

for ` = L− 1 to 1 dofor each node i in layer ` do

∆[i]← h′(vi)∑

j wi,j∆[j]

/*Update every weight in network using deltas*/

for each weight wi,j in network dowi,j ← wi,j + αai∆[j]

until some stopping criterion is satisfied

19 / 23

Page 68: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Backpropagation Algorithm

repeatfor each weight wi,j in the network do

wi,j ← a small random number

for each example (x, y) do/*Propagate the inputs forward to compute the outputs*/for each node i in the input layer do

ai ← xi

for ` = 2 to L dofor each node j in layer ` do

vj ←∑

i wi,jai

aj ← h(vj)

/*Propagate the deltas backwards from output layer to input layer*/for each node j in output layer do

∆[j]← h′(vj)(yj − aj)

for ` = L− 1 to 1 dofor each node i in layer ` do

∆[i]← h′(vi)∑

j wi,j∆[j]

/*Update every weight in network using deltas*/

for each weight wi,j in network dowi,j ← wi,j + αai∆[j]

until some stopping criterion is satisfied

19 / 23

Page 69: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Backpropagation Algorithm

repeatfor each weight wi,j in the network do

wi,j ← a small random number

for each example (x, y) do/*Propagate the inputs forward to compute the outputs*/for each node i in the input layer do

ai ← xi

for ` = 2 to L dofor each node j in layer ` do

vj ←∑

i wi,jai

aj ← h(vj)

/*Propagate the deltas backwards from output layer to input layer*/for each node j in output layer do

∆[j]← h′(vj)(yj − aj)

for ` = L− 1 to 1 dofor each node i in layer ` do

∆[i]← h′(vi)∑

j wi,j∆[j]

/*Update every weight in network using deltas*/for each weight wi,j in network do

wi,j ← wi,j + αai∆[j]

until some stopping criterion is satisfied

19 / 23

Page 70: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Backpropagation

Derivations of the back-propagation equations are very similar to thegradient calculation for logistic regression.

Uses the multivariable chain rule more than once.

Consider the loss for a single node:

f(w,W ) =1

2(yi −

k∑j=1

wjh(Wjxi))2.

Derivatives with respect to wj (weights connect hidden to output layer):

∂f(w,W )

∂wj= −(yi −

k∑j=1

wjh(Wjxi)) · h(Wjxi)

Derivative with respect to Wij (weights connect input to hidden layer):

∂f(w,W )

∂Wij= −(yi −

k∑j=1

wjh(Wjxi)) · wjh′(Wjxi)xij

20 / 23

Page 71: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Backpropagation

Derivations of the back-propagation equations are very similar to thegradient calculation for logistic regression.

Uses the multivariable chain rule more than once.

Consider the loss for a single node:

f(w,W ) =1

2(yi −

k∑j=1

wjh(Wjxi))2.

Derivatives with respect to wj (weights connect hidden to output layer):

∂f(w,W )

∂wj= −(yi −

k∑j=1

wjh(Wjxi)) · h(Wjxi)

Derivative with respect to Wij (weights connect input to hidden layer):

∂f(w,W )

∂Wij= −(yi −

k∑j=1

wjh(Wjxi)) · wjh′(Wjxi)xij

20 / 23

Page 72: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Backpropagation

Derivations of the back-propagation equations are very similar to thegradient calculation for logistic regression.

Uses the multivariable chain rule more than once.

Consider the loss for a single node:

f(w,W ) =1

2(yi −

k∑j=1

wjh(Wjxi))2.

Derivatives with respect to wj (weights connect hidden to output layer):

∂f(w,W )

∂wj= −(yi −

k∑j=1

wjh(Wjxi)) · h(Wjxi)

Derivative with respect to Wij (weights connect input to hidden layer):

∂f(w,W )

∂Wij= −(yi −

k∑j=1

wjh(Wjxi)) · wjh′(Wjxi)xij

20 / 23

Page 73: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Backpropagation

Derivations of the back-propagation equations are very similar to thegradient calculation for logistic regression.

Uses the multivariable chain rule more than once.

Consider the loss for a single node:

f(w,W ) =1

2(yi −

k∑j=1

wjh(Wjxi))2.

Derivatives with respect to wj (weights connect hidden to output layer):

∂f(w,W )

∂wj= −(yi −

k∑j=1

wjh(Wjxi)) · h(Wjxi)

Derivative with respect to Wij (weights connect input to hidden layer):

∂f(w,W )

∂Wij= −(yi −

k∑j=1

wjh(Wjxi)) · wjh′(Wjxi)xij

20 / 23

Page 74: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Backpropagation

Notice repeated calculations in gradients:

∂f(w,W )

∂wj= −(yi −

k∑j=1

wjh(Wjx)) · h(Wjx)

∂f(w,W )

∂Wij= −(yi −

k∑j=1

wjh(Wjx)) · wjh′(Wjx)xi

Same value for ∂f(w,W )∂wj

with all j and ∂f(w,W )∂Wij

for all i and j.

Same value for ∂f(w,W )∂Wij

with all i.

21 / 23

Page 75: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Backpropagation

Notice repeated calculations in gradients:

∂f(w,W )

∂wj= −(yi −

k∑j=1

wjh(Wjx)) · h(Wjx)

∂f(w,W )

∂Wij= −(yi −

k∑j=1

wjh(Wjx)) · wjh′(Wjx)xi

Same value for ∂f(w,W )∂wj

with all j and ∂f(w,W )∂Wij

for all i and j.

Same value for ∂f(w,W )∂Wij

with all i.

21 / 23

Page 76: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Disadvantages of Neural Networks

Cannot be used if very little available data.

Many free parameters (e.g., number of hidden nodes, learning rate,minimal error, etc.)

Not great for precision calculations.

Do not provide explanations (unlike slopes in linear models thatrepresent correlations).

Network overfits training data, fails to capture true statistical processbehind the data.

Early stopping can be used to avoid this.

Speed of convergence.

Local minimia.

22 / 23

Page 77: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Disadvantages of Neural Networks

Cannot be used if very little available data.

Many free parameters (e.g., number of hidden nodes, learning rate,minimal error, etc.)

Not great for precision calculations.

Do not provide explanations (unlike slopes in linear models thatrepresent correlations).

Network overfits training data, fails to capture true statistical processbehind the data.

Early stopping can be used to avoid this.

Speed of convergence.

Local minimia.

22 / 23

Page 78: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Disadvantages of Neural Networks

Cannot be used if very little available data.

Many free parameters (e.g., number of hidden nodes, learning rate,minimal error, etc.)

Not great for precision calculations.

Do not provide explanations (unlike slopes in linear models thatrepresent correlations).

Network overfits training data, fails to capture true statistical processbehind the data.

Early stopping can be used to avoid this.

Speed of convergence.

Local minimia.

22 / 23

Page 79: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Disadvantages of Neural Networks

Cannot be used if very little available data.

Many free parameters (e.g., number of hidden nodes, learning rate,minimal error, etc.)

Not great for precision calculations.

Do not provide explanations (unlike slopes in linear models thatrepresent correlations).

Network overfits training data, fails to capture true statistical processbehind the data.

Early stopping can be used to avoid this.

Speed of convergence.

Local minimia.

22 / 23

Page 80: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Disadvantages of Neural Networks

Cannot be used if very little available data.

Many free parameters (e.g., number of hidden nodes, learning rate,minimal error, etc.)

Not great for precision calculations.

Do not provide explanations (unlike slopes in linear models thatrepresent correlations).

Network overfits training data, fails to capture true statistical processbehind the data.

Early stopping can be used to avoid this.

Speed of convergence.

Local minimia.

22 / 23

Page 81: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Disadvantages of Neural Networks

Cannot be used if very little available data.

Many free parameters (e.g., number of hidden nodes, learning rate,minimal error, etc.)

Not great for precision calculations.

Do not provide explanations (unlike slopes in linear models thatrepresent correlations).

Network overfits training data, fails to capture true statistical processbehind the data.

Early stopping can be used to avoid this.

Speed of convergence.

Local minimia.

22 / 23

Page 82: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Advantages of Neural Networks

Easy to maintain.

Versatile.

Used on problems for which analytical methods do not yet exist.

Models non-linear dependencies.

Quickly work out patterns, even with noisy data.

Always outputs answer, even when input is incomplete.

23 / 23

Page 83: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Advantages of Neural Networks

Easy to maintain.

Versatile.

Used on problems for which analytical methods do not yet exist.

Models non-linear dependencies.

Quickly work out patterns, even with noisy data.

Always outputs answer, even when input is incomplete.

23 / 23

Page 84: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Advantages of Neural Networks

Easy to maintain.

Versatile.

Used on problems for which analytical methods do not yet exist.

Models non-linear dependencies.

Quickly work out patterns, even with noisy data.

Always outputs answer, even when input is incomplete.

23 / 23

Page 85: Feedforward Neural Nets and Backpropagation · Feedforward Neural Nets and Backpropagation Julie Nutini University of British Columbia MLRG September 28th, 2016 1/23. Supervised Learning

Advantages of Neural Networks

Easy to maintain.

Versatile.

Used on problems for which analytical methods do not yet exist.

Models non-linear dependencies.

Quickly work out patterns, even with noisy data.

Always outputs answer, even when input is incomplete.

23 / 23