neural networks: multilayer perceptron

42
CHAPTER 04 MULTILAYER PERCEPTRONS CSC445: Neural Networks Prof. Dr. Mostafa Gadal-Haqq M. Mostafa Computer Science Department Faculty of Computer & Information Sciences AIN SHAMS UNIVERSITY (most of figures in this presentation are copyrighted to Pearson Education, Inc.)

Upload: mostafa-g-m-mostafa

Post on 16-Apr-2017

257 views

Category:

Education


6 download

TRANSCRIPT

Page 1: Neural Networks: Multilayer Perceptron

CHAPTER 04

MULTILAYER PERCEPTRONS

CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq M. Mostafa

Computer Science Department

Faculty of Computer & Information Sciences

AIN SHAMS UNIVERSITY

(most of figures in this presentation are copyrighted to Pearson Education, Inc.)

Page 2: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

Introduction

Limitation of Rosenblatt’s Perceptron

Batch Learning and On-line Learning

The Back-propagation Algorithm

Heuristics for Making the BP Alg. Perform Better

Computer Experiment

2

Multilayer Perceptron

Page 3: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

Introduction

Limitation of Rosenblatt’s Perceptron

AND operation:

3

w ww

w ww

w ww

w ww

011

001

010

000

021

021

021

021

www

w w

ww

w

021

01

02

0 0

d x2 x1

0 0 0

0 1 0

0 0 1

1 1 1

+1

x1

x2

w0

w1

w2

y

Its easy to find a set of weight that satisfy the above inequalities.

x xfy )201010( 21

zezf

1

1)(

Linear

Decision boundary

Page 4: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

Introduction

Limitation of Rosenblatt’s Perceptron

OR Operation:

4

w ww

w ww

w ww

w ww

011

001

010

000

021

021

021

021

www

w w

ww

w

021

01

02

0 0

d x2 x1

0 0 0

1 1 0

1 0 1

1 1 1

+1

x1

x2

w0

w1

w2

y

Its easy to find a set of weight that satisfy the above inequalities.

x xfy )102020( 21

zezf

1

1)(

Linear

Decision boundary

Page 5: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

Introduction

Limitation of Rosenblatt’s Perceptron

XOR Operation:

5

w ww

w ww

w ww

w ww

011

001

010

000

021

021

021

021

www

w w

ww

w

021

01

02

0 0

Clearly the second and third inequalities are incompatible with the fourth, so

there is no solution for the XOR problem. We need more complex networks!

d x2 x1

0 0 0

1 1 0

1 0 1

0 1 1

+1

x1

x2

w0

w1

w2

y

Non-linear

Decision boundary

fy (???)

Page 6: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

The XOR Problem

A two-layer Network to solve the XOR Problem

Figure 4.8 (a) Architectural graph of network for solving the XOR problem. (b)

Signal-flow graph of the network.

6

b

ww

2

3

1

1

1211

b

ww

2

1

1

2

2221

b

ww

2

1

1 ,2

3

3231

Page 7: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

The XOR Problem

A two-layer Network to solve the XOR Problem

Figure 4.9 (a) Decision boundary constructed by hidden neuron 1 of the network in

Fig. 4.8. (b) Decision boundary constructed by hidden neuron 2 of the network. (c)

Decision boundaries constructed by the complete network.

7

Page 8: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq 8

MLP: Some Preliminaries

The multilayer perceptron (MLP) is proposed to overcome the limitations of the perceptron

That is, building a network that can solve nonlinear problems.

The basic features of the multilayer perceptrons:

Each neuron in the network includes a nonlinear activation function that is differentiable.

The network contains one or more layers that are hidden from both the input and output nodes.

The network exhibits a high degree of connectivity.

Page 9: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

MLP: Some Preliminaries

Architecture of a multilayer perceptron

Figure 4.1 Architectural graph of a multilayer perceptron with two hidden layers.

9

Page 10: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

MLP: Some Preliminaries

Weight Dimensions

10

If network has n units in layer i , m units in layer i +1 , then the weight

matrix Wij will be of dimension m x (n+1) .

Wij

Page 11: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

MLP: Some Preliminaries

Number of neuron in the output layer

11

Pedestrian Car Motorcycle Truck

Car Pedestrain Moto Truck

1000

0100

0010

0001

Page 12: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq 12

MLP: Some Preliminaries

Training of the multilayer perceptron proceeds in two phases:

In the forward phase, the weights of the network are fixed and the input signal is propagated through the network, layer by layer, until it reaches the output.

In the backward phase, the error signal, which is produced by comparing the output of the network and the desired response, is propagated through the network, again layer by layer, but in the backward direction.

Page 13: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

MLP: Some Preliminaries

Function Signal:

is the input signal that comes in at the input end of the network, propagates forward (neuron by neuron) through the network, and emerges at the output of the network as an output signal.

Error Signal:

originate at the output neuron of the network and propagates backward (layer by layer) through the network.

Each hidden or output neuron computes these two signals.

Figure 4.2 Illustration of the

directions of two basic signal flows

in a multilayer perceptron: forward

propagation of function signals

and back propagation of error

signals.

13

Page 14: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

MLP: Some Preliminaries

Function of the Hidden neurons

The hidden neurons play a critical role in the operation of a multilayer perceptron; they act as feature detectors.

The nonlinearity transform the input data into a feature space in which data may be separated easily.

Credit Assignment Problem

Is the problem of assigning a credit or a blame for overall outcomes to the internal decisions made by the computational units of the distributed learning system.

The error-correction learning algorithm is easy to use for training single layer perceptrons. But its not easy to use it for a multilayer perceptrons,

the backpropagation algorithm solves this problem.

14

Page 15: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

The Back-propagation Algorithm

An on-line learning algorithm.

Figure 4.3 Signal-flow graph highlighting the

details of output neuron j.

15

m

iijij nynwnv

0

)()()(

))(()( nvny ij

)()()( nyndne jjj

Page 16: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

The Back-propagation Algorithm

The weights are updated in a manner similar to the LMS and

the gradient descent method. That is, the instantaneous error

and the weight corrections are:

and

Using the chain rule of calculus, we get:

We have:

16

(n)e2

1(n) 2

jj E(n)w

(n)ηw

ji

jji

)(and))((1 ny(n)w

(n)v ,nv

(n)v

(n)y,

(n)y

(n)e(n), e

(n)e

(n)i

ji

jjj

ji

j

j

jj

j

j

E

(n)w

(n)v

(n)v

(n)y

(n)y

(n)e

(n)e

(n)

(n)w

(n)

ji

j

j

j

j

j

j

j

ji

j

EE

Page 17: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

The Back-propagation Algorithm

which yields:

Then the weight correction is given by:

where the local gradient j (n) is defined by:

17

)(

)(

)(

)(

)(

)(

)(

)()(

nv

ny

ny

ne

ne

n

nv

nn

j

j

j

j

j

j

j

jj

E

E

)())(( nynv(n)e(n)w

(n)ijjj

ji

j

E

)()()(Δ nynηnw ijji

))(()()( nvnen jjjj

Page 18: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

The Back-propagation Algorithm

That is, the local gradient of neuron j is equal to the product of the corresponding error signal of that neuron and the derivative of the associated of the activation function. Then, we have two distinct cases:

Case 1: Neuron j is an output node:

In this case, it is easy to use the credit assignment rule to compute the error signal ej(n), because we have the desired signal visible to the output neuron. That is, ej(n)=dj(n) - yj(n).

Case 2: Neuron j is an hidden node:

In this case, the desired signal is not visible to the hidden neuron. Accordingly, the error signal for the hidden neuron would have to be determined recursively and working backwards in terms of the error signals of all the neurons to which that hidden neuron is directly connected.

18

Page 19: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

The Back-propagation Algorithm

Case 2: Neuron j is hidden node.

Figure 4.4 Signal-flow graph highlighting the details of output neuron k connected

to hidden neuron j.

19

Page 20: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

The Back-propagation Algorithm

We redefine the local gradient for a hidden neuron j as:

Where the total instantaneous error of the output neuron k:

Differentiating w. r. t. yj (n) yields:

But

Hence

20

))(()(

)(

)(

)(

)(

)()( nv

ny

n

nv

ny

ny

nn jj

jj

j

jj

EE

Ck

k(n) e (n) 2

2

1E

k j

k

k

kk

k j

kk

j ny

v

nv

nene

ny

nene

ny

n

)()(

)()(

)(

)()(

)(

)(E

))(()()()( nvndnyn d(n) e kkkkkk

))(()(

nvnv

ekk

k

k

Page 21: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

The Back-propagation Algorithm

Also, we have

Differentiating, yields:

Then, we get

Finally, the backpropagation for the local gradient of (hidden)

neuron j, (neuron k is output neuron), is given by:

21

k

kjkjjj nwnnvn )()())(()(

kkjk

kkjkkk

j

wnwnvneny

n)())(()(

)(

)(

E

)()(

)(nw

ny

nvkj

j

k

m

jjkjk nynwnv

0

)()()(

Page 22: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

The Back-propagation Algorithm

Figure 4.5 Signal-flow graph of a part of the adjoint system pertaining to back-

propagation of error signals.

22

Page 23: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

The Back-propagation Algorithm

We summarize the relations for the back-propagation algorithm:

First: the correction wji(n) applied to the weight connecting neuron i to neuron j is defined by the delta rule:

Second: local gradient j (n) depends on neuron j :

Neuron j is an output node:

Neuron j is an hidden node (neuron k is output or hidden):

23

)(

jneuron of

signalinput

)(

gradient

local

parameter

ratelearning

)(

correction

weight

nynnw ijji

)()()( ;))(()()( nyndnenvnen jjjjjjj

k

kjkjjj nwnnvn )()())(()(

Page 24: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

The Activation Function

Differentiability is the only requirement that an activation function has to satisfy in the BP Algoruthm.

This is required to compute the for each neuron.

Sigmoidal functions are commonly used, since they satisfy such a condition:

Logistic Function

Hyperbolic Tangent Function

24

0a ,)exp(1

1)(

avv )](1)[(

)exp(1

)exp()(' vva

av

avav

0ba, ,)tanh()( bvav )]()][([)(' vavaa

bv

Page 25: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

The Rate of Learning

A simple method of increasing the rate of learning and avoiding instability (for large learning rate ) is to modify the delta rule by including a momentum term as:

Figure 4.6 Signal-flow graph

illustrating the effect of

momentum constant α, which lies

inside the feedback loop.

25

where is usually a positive number called the momentum constant.

To ensure convergence, the momentum constant must be restricted to

)()()1()(Δ nynηnwnw ijjiji

10

Page 26: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

Summary of the Back-propagation Algorithm

1. Initialization

2. Presentation of

training example

3. Forward

computation

4. Backward

computation

5. Iteration

Figure 4.7 Signal-flow graphical summary of back-propagation learning. Top part of

the graph: forward pass. Bottom part of the graph: backward pass.

26

Page 27: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

Heuristics for making the BP Better

1. Stochastic vs. Batch update

Stochastic (sequential) mode is computationally faster than the batch mode.

2. Maximizing information content

Use an example that results in large training error

Use an example that is radically different from the others.

3. Activation function

Use an odd function

Hyperbolic not logistic function

27

)tanh()( bvav

Page 28: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

Heuristics for making the BP Better

4. Target values

Its very important to choose the values of the desired response to be within the range of the sigmoid function.

5. Normalizing the input

Each input should be preprocessed so that its mean value, averaged over the entire training sample, is close to zero, or else it will be small compared to its standard deviation.

28

Figure 4.11 Illustrating the operation of mean

removal, decorrelation, and covariance

equalization for a two-dimensional input space.

Page 29: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

Heuristics for making the BP Better

6. Initialization

A good choice will be of tremendous help.

Initialize the weights so that the standard deviation of the induced local field v of a neuron lies in the transition area between the linear and saturated parts or its sigmoid function.

7. Learning from hints

Is achieved by allowing prior information that we may have about the mapping function, e.g., symmetry, invariances, etc.

8. Learning rate

All neurons in the multilayer should learn at the same rate, except for that at the last layer, the learning rate should be assigned smaller value than that of the front layers.

29

Page 30: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

Batch Learning and On-line Learning

Consider the training sample used to train the network in supervised

manner:

T = {x(n), d(n); n =1, 2, …, N}

If yj(n) is the functional signal produced at the output neuron j. the

error signal produced at the same neuron is:

ej (n) = dj(n) – yj (n)

the instantaneous error produced at the output neuron j is:

the total instantaneous error of the whole network is:

the total instantaneous error averaged over the training sample:

30

Cj

2j

Cjj (n)e

2

1(n) (n) EE

N

1n Cj

2j

N

1nav (n)e

2N

1 (n)

N

1(n) EE

(n)e2

1(n) 2

jj E

Page 31: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

Batch Learning and On-line Learning

Batch Learning:

Adjustment of the weights of the MLP is performed after the presentation of all the N training examples T.

this is called an epoch of training.

Thus, weight adjustment is made on epoch-by-epoch basis.

After each epoch, the examples in the training samples T are randomly

shuffled.

Advantages:

Accurate estimation of the gradient vector (the derivates of the cost function Eav w.r.t. the weight vector w), which therefore guarantee the

convergence of the method of steepest descent to a local minimum.

Parallelization of the learning process.

Disadvantages: it is demanding in terms of storage requirements.

31

Page 32: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

Batch Learning and On-line Learning

On-line Learning: Adjustment of the weights of the MLP are performed on an example-

by-example basis.

The cost function to be minimized is therefore the total instantaneous error E (n).

An epoch of training is the presentation all the N samples to the network. Also, in each epoch the examples are randomly shuffled.

Advantages:

Its stochastic learning nature, make it less likely to be trapped in local minimum.

it is much less demanding in terms of storage requirements.

Disadvantages:

We can not Parallelize the learning process.

32

Page 33: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

Batch Learning and On-line Learning

Batch learning vs. On-line Learning:

33

On-line Learning Batch learning

The learning process is performed in stochastic manner.

The learning process is performed by ensemble averaging, which in statistical context my be viewed as a form of statistical inference.

It is less likely to be trapper in a local minimum.

Guarantee for convergence to local minimum.

Can not be parallelized Can be parallelized

Require much less storage Require large storage

Well suited for pattern classification problems.

Well suited for nonlinear regression problems.

Page 34: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

Generalization A network is said to generalize well when

the network input-output mapping is correct (or nearly so) for the test data.

If we viewed the learning process as “curve-fitting”.

When the network is trained with too many sample, it may become overfitted, or overtrained, which lead to wrong generalization.

Sufficient training-Sample Size

Generalization is influenced by three factors: The size of the training sample

The network architecture

The physical complexity of the problem at hand

In practice, good generalization is achieved if we the training sample size, N, satisfies:

W is number of free parameters in the network, and is the fraction of classification error permitted on test data.

Figure 4.16 (a) Properly fitted nonlinear

mapping with good generalization. (b) Overfitted

nonlinear mapping with poor generalization.

34

)/( WON

Page 35: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

Cross-Validation Method

Cross-Validation is a standard tool in statistics that provide appealing guiding principle:

First: the available data set is randomly partitioned into a training set and a test set.

Second: the training set is further partitioned into two disjoint subsets:

An estimation subset, used to select the model (estimate the parameters).

A validation subset, used to test or validate the model

The training set is used to assess various models and choose the “best” one.

However, this best model may be overfitting the validation data.

Then, to guard against this possibility, the generalization performance is measured on the test set, which is different from the validation subset.

35

Page 36: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

Cross-Validation Method

Early-stopping Method

(Holdout method)

The training is stopped periodically, i.e., after so many epochs, and the network is assessed using the validation subset.

When the validation phase is complete, the estimation (training) is resumed for another period, and the process is repeated.

The best model (parameters) is that at the minimum validation error.

Figure 4.17 Illustration of the early-

stopping rule based on cross-

validation.

36

Page 37: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

Cross-Validation Method

Variant of Cross-Validation

(Multifold Method)

Divide the data set of N samples into K subsets, where K>1.

The network is validated in each trial using a different subset. After training the network using the other subsets.

The performance of the model is assessed by averaging the squared error under validation over all trials.

Figure 4.18 Illustration of the multifold

method of cross-validation. For a given trial,

the subset of data shaded in red is used to

validate the model trained on the remaining

data.

37

Page 38: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

Computer Experiment

d= -4

Figure 4.12 Results of the computer experiment on the back-propagation

algorithm applied to the MLP with distance d = –4. MSE stands for mean-square

error.

38

Page 39: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

Computer Experiment

d = -5

Figure 4.13 Results of the computer experiment on the back-propagation

algorithm applied to the MLP with distance d = –5.

39

Page 40: Neural Networks: Multilayer Perceptron

ASU-CSC445: Neural Networks

Prof. Dr. Mostafa Gadal-Haqq

Real Experiment

Handwritten Digit Recognition*

*Courtesy of Yann LeCun.

40

Page 41: Neural Networks: Multilayer Perceptron

•Problems:

•4.1, 4.3

•Computer Experiment

•4.15

Homework 4

41

Page 42: Neural Networks: Multilayer Perceptron

Kernel Methods and

RBF Networks

Next Time

42