lecture4: perceptron and adaline - mshdiau.ac.irghoshuni.mshdiau.ac.ir/ann/files/chap4.pdf ·...

28
In The name of God Lecture4: Lecture4: Perceptron and ADALINE Perceptron and ADALINE D M jid Gh h i Dr. Majid Ghoshuni 1 Introduction Introduction Th R bl LMS l ih f P (1958) The Rosenblatts LMS algorithm for Perceptron (1958) is built around a linear neuron (a neuron with a linear activation function) activation function). However, the Perceptron is built around a nonlinear neuron namely the McCulloch-Pitts model of a neuron neuron, namely the McCulloch-Pitts model of a neuron. This neuron has a hard- limiting activation function (performing the signum function) (performing the signum function). Recently the term multilayer Perceptron has often been used as a synonym for the term multilayer feedforward used as a synonym for the term multilayer feedforward neural network. In this section we will be referring to the former meaning. 2

Upload: hakhanh

Post on 30-Sep-2018

227 views

Category:

Documents


0 download

TRANSCRIPT

In The name of God

Lecture4:Lecture4:

Perceptron and ADALINEPerceptron and ADALINE

D M jid Gh h iDr. Majid Ghoshuni

1

IntroductionIntroduction

Th R bl ’ LMS l i h f P (1958)• The Rosenblatt’s LMS algorithm for Perceptron (1958)is built around a linear neuron (a neuron with a linearactivation function)activation function).

• However, the Perceptron is built around a nonlinearneuron namely the McCulloch-Pitts model of a neuronneuron, namely the McCulloch-Pitts model of a neuron.

• This neuron has a hard- limiting activation function(performing the signum function)(performing the signum function).

• Recently the term multilayer Perceptron has often beenused as a synonym for the term multilayer feedforwardused as a synonym for the term multilayer feedforwardneural network. In this section we will be referring tothe former meaning.

2

Perceptron(1)Perceptron(1)

• Goall if i li d I t i t f t l– classifying applied Input into one of two classes

• Procedure– if output of hard limiter is +1, to class C1; – if it is -1, to class C2

• input of hard limiter : weighted sum of input– effect of bias b is merely to shift decision

boundary away from originboundary away from origin– synaptic weights adapted on

iteration by iteration basis

3

Perceptron(2)Perceptron(2)

• Decision regions separated by a hyper plane

– point (x1,x2) above boundary

line is assigned to C1g 1

– point (y1,y2) below boundary

line to class Cline to class C2

4

Perceptron Learning Theorem (1)Perceptron Learning Theorem (1)

Li l bl• Linearly separable– if two classes are linearly separable, there exists decision

surface consisting of hyper planesurface consisting of hyper plane.– If so, there exists weight vector w

wTx > 0 for every input vector x belonging to class Cw x > 0 for every input vector x belonging to class C1

wTx ≤ 0 for every input vector x belonging to class C2

• for only linearly• for only linearlyseparable classes,

t kperceptron workswell

5

Perceptron Learning Theorem (2)Perceptron Learning Theorem (2)

• Using modified signal-flow graph– bias b(n) is treated as synaptic weight driven by fixed input

+1

– w0(n) is b(n)

– linear combiner output

6

Perceptron Learning Theorem (3)Perceptron Learning Theorem (3)

i• Weight adjustment– if x(n) is correctly classified

– Otherwise

learning rate parameter η(n) controls adjustment– learning rate parameter η(n) controls adjustmentapplied to weight vector

7

Summary of LearningSummary of Learning

1 I iti li ti1. Initialization– set w(0)=0

2. Activation– at time step n, activate perceptron by applying continuous valued input

vector x(n) and desired response d(n)3. Computation of actual responsep p

y(n) = sgn[wT(n) x(n)]4. Adaptation of Weight Vector

5. Continuation– increment time step n and go back to step 2

8

The network is capable of solving linearlyseparable problemp p

9

Learning ruleLearning rule

• An algorithm to update the weights w so thatfinally the input patterns lie on both sides ofy p pthe line decided by the perceptron

• Let t be the time at t = 0 we have• Let t be the time, at t = 0, we have

10

Learning ruleLearning rule

• An algorithm to update the weights w so thatfinally the input patterns lie on both sides ofy p pthe line decided by the perceptron

• Let t be the time at t = 1 we have• Let t be the time, at t = 1, we have

11

Learning ruleLearning rule

• An algorithm to update the weights w so thatfinally the input patterns lie on both sides ofy p pthe line decided by the perceptron

• Let t be the time at t = 2 we have• Let t be the time, at t = 2, we have

12

Learning ruleLearning rule

• An algorithm to update the weights w so thatfinally the input patterns lie on both sides ofy p pthe line decided by the perceptron

• Let t be the time at t = 3 we have• Let t be the time, at t = 3, we have

13

Implementation of Logical NOT, AND,and OR

14

Implementation of Logical Gate

15

Finding Weights by MSE MethodFinding Weights by MSE Method

• Write a equation for each training data

• Output for first class is +1 and for second classOutput for first class is +1 and for second class is -1(or 0)

A l h MSE h d l h bl• Apply the MSE method to solve the problem

• Example: Implementation of AND gate 1001

p e: p e e o o g e

1

5.11

1

*1

0

0

0

1

1bb

1

11

*011

2

1

2

1

w

w

w

w

16

1111 22

Summary: Perceptron vs MSE proceduresSummary: Perceptron vs. MSE procedures

• Perceptron rule– The perceptron rule always finds aThe perceptron rule always finds a

solution if the classes are linearly separable.But does not converge if the classes are– But does not converge if the classes are not-separable.

• MSE criterion– The MSE solution has guaranteed

convergence, but it may not find a separating hyperplane if classes areseparating hyperplane if classes are linearly separable

• Notice that MSE tries to minimize the sum of the squares of the distances of the training datathe squares of the distances of the training data to the separating hyperplane.

17

Convergence of the Perceptronlearning law

• Rosenblatt proved that if input patterns arelinearly separable, then the Perceptron learningy p , p glaw converges, and the hyperplane separatingtwo classes of input patterns can betwo classes of input patterns can bedetermined.

• Fixed increment convergence theorem forlinearly separable vectors X1 and X2 ,y pperceptron converges after some n0 iterations

18

Limitation of PerceptronLimitation of Perceptron

• The XOR problem (Minsky): nonlinearseparabilityp y

19

Perceptron with sigmoidactivation function• For single neuron with step activation function:

• For single neuron with Sigmoid activation function:

20

Representation of Perceptron inRepresentation of Perceptron in MATLAB

21

MATLAB TOOLBOXMATLAB TOOLBOX

( f lf)• net = newp(p,t,tf,lf)• Description of function

– Perceptrons are used to solve simple (i.e. linearlyseparable) classification problems.

• NET = NEWP(P T TF LF) takes two inputs• NET = NEWP(P,T,TF,LF) takes two inputs,– P : R-by-Q matrix of Q input vectors of R elements each..

T : S by Q matrix of Q target vectors of S elements each– T : S-by-Q matrix of Q target vectors of S elements each.– TF: Transfer function, default = 'hardlim'.– LF: Learning function default = 'learnp'LF: Learning function, default learnp .

Returns a new perceptron.

22

Classification example: Linear biliseparability

• See the M file_

23

24

25

Classification of data: nonlinear separability

26

Classification of data: nonlinear separability

27

ADALINE:The Adaptive Linear Element

• ADALINE is Perceptron with linear activation function

• This is proposed by Widrow

28

Applications of AdalineApplications of Adaline

• In general, the Adaline is used to perform– Linear approximation of a “small” segment of app g

nonlinear hyper surface, which is generated by ap– variable function, y = f( x). In this case, the biasp , y ( ) ,is usually needed.

– Linear filtering and prediction of data (signals);Linear filtering and prediction of data (signals);

– Pattern association, that is, generation of m–element output vectors associated with respectiveelement output vectors associated with respective p–element input vectors.

29

Error conceptError concept

F i l• For single neuronε = d-y

• For multi neuronεi = di-yi i=1:m

dεm×1= dm×1-ym×1

– m is number of output neuronTh t t l f th d f• The total measure of the goodness ofapproximation, or the performance index, can bespecified by the mean- squared error over mspecified by the mean squared error over mneurons and N training vectors:

N m130

N

i

m

jj ie

mNWJ

1 1

2 )(1

)(

W :eight

m

m

ww

ww

ww

221

111

mpW

g

pmp ww 1

31

• The MSE solution is:

mNt

NppNt

Npmp DXXXW

1)(

• The Error equation is:

1

mmN

tNm EE

NWJ )(

1)(

mppNmN WXDE

32

• For single neuron m=1:

1)(

1)( 11 N

tN EE

NWJ

• Replacing error in equation:)]()[(

1)( XWDXWDWJ t

)])([(1

)]()[()(

XWDXWDN

XWDXWDN

WJ

ttt

][1

XWXWDXWXWDDDN

N

tttttt

33

]2[1

XWXWXWDDDN

tttt

10

1.1

Example1:

12

11

2.3

8.1

14

13

X

8.4

1.4

D

1w

W

16

15X

3.7

7.5D

2

1

wW

18

17

2.9

9.7

19

9.9

11 ]45285

][]55330[*238385[1

)(w

www

wwJ

22

221

221

109028511066038385[1

)(

]1045

][]55330[*238.385[10

),(

J

www

wwwJ

34

2221

212121 109028511066038.385[

10),( wwwwwwwwJ

The plot of performance indexJ(w1,w2) of example1

35

Example 2:the performance index in general case

36

Method of steepest descentMethod of steepest descent

f i l h d f l l i ill b hi h• If N is large the order of calculation will be high.• In order to avoid this problem, we can find the

optimal weight vector for which the mean-squared error, J(w).

• J(w) attains minimum by iterative modification ofthe weight vector for each training exemplar ing g pthe direction opposite to the gradient of theperformance index, J(w),

• An example illustrated in Figure 4– 5 for a singleweight situation.g

37

Illustration of the steepest descentmethod

38

Method of steepest descentMethod of steepest descent

Wh th i ht t tt i th ti l l• When the weight vector attains the optimal valuefor which the gradient is zero (w0 in Figure 4–5),the iterations are stoppedthe iterations are stopped.

• More precisely, the iterations are specified as

• where the weight adjustment, w(n), isproportional to the gradient of the mean squaredproportional to the gradient of the mean-squarederror

• where η is a learning gain.

39

The LMS (Widrow Hoff) Learning LawThe LMS (Widrow‐ Hoff) Learning Law

• The Least- Mean- Square learning law replacesthe gradient of the mean- squared error withg qthe gradient update and can be written infollowing form:following form:

)()()( nnxnW mNt

Npmp

:1

d

miyd iii

)()()()1( WW

ydt

mNmNmN

40

)()()()1( nnxnWnW t

The LMS (Widrow‐ Hoff) Learning Law

• For single neuron

liF neuronnonlinearFor

ii xwy

neuronlinear For

ii xwv

neuronnonlinear For

xwdJ

yd

)(11

)(

22

yd

vy

)(

)(

ii

xxwdyJ

xwdJ

)(

)(22

vdJ ))((

2

1

2

1 22 iii

ii

xxwdww

)(

iii

xvvdw

v

v

J

w

J)())((

41

ii

network trainingnetwork training

• Two types of network training:– Sequential mode or incremental (on-line, q ( ,

stochastic, or per-pattern):• Weights updated after each pattern is presentedWeights updated after each pattern is presented

– Batch mode (off-line or per-epoch) :• Weights updated after all pattern is presented• Weights updated after all pattern is presented

42

Some general comments on the learning process

C t ti ll th l i• Computationally, the learning process goesthrough all training examples (an epoch) numberof times until a stopping criterion is reachedof times, until a stopping criterion is reached.

• The convergence process can be monitored withthe plot of the mean- squared error functionthe plot of the mean squared error functionJ(W(n)).

• The popular stopping criteria are:The popular stopping criteria are:– the mean- squared error is sufficiently small:

– The rate of change of the mean- squared error issufficiently small:

43

Th ff t f l i R t ƞThe effect of learning Rate: ƞ

44

Applications (1)Applications (1)

• MA (Moving average) modeling (filtering)

• For m=2 012 xxx b2

y

123

012

xxx

X 1

0

b

b

w3

y

D

453)1(21

NNNN xxx

132

b 1)1(

NNy

Applications (2)Applications (2)

• AR (auto regressive) modeling:

M

M

ii Mnbxinyany

1

Model ofOrder :)()()(

• For M=2:

xyy 2 y

312

201

xyy

xyy

X

a

a

w 2

13

2

y

y

D

3)1(21

NNNN xyy

X

b

aw 2

y

D

46

3)1( N3)1( NNy

Applications (3)Applications (3)

• PID controller:

47

Simulation of MA modelingSimulation of MA modeling

• Suppose the MA model as:

22

1

Mb 2

3

2

Mb

• Input is Gaussian noise with mean=0 and var=1

• y is Calculated by recursive equation• y is Calculated by recursive equation

• Please see the M_file

48

M file of MA ModelingM_file of MA Modeling

49

MAModelingMA Modeling

• Zeros weight initial and η= 0.01

• Number of data training set: N=20Number of data training set: N 20

50

MAModelingMA Modeling

• Zeros weight initial and η= 0.1

• Number of data training set: N=20Number of data training set: N 20

51

MAModelingMA Modeling

• Random weight initial and η= 0.01

• Number of data training set: N=20Number of data training set: N 20

52

MAModelingMA Modeling

• Random weight initial and η= 0.1

• Number of data training set: N=20Number of data training set: N 20

53

MAModelingMA Modeling

• Random weight initial and η= 0.1

• Number of data training set: N=10Number of data training set: N 10

54

MATLAB TOOLBOXMATLAB TOOLBOX

li (PR S ID LR)• net = newlin(PR,S,ID,LR)• Description of function

– Linear layers are often used as adaptive filters for signal processing and prediction.

• NEWLIN(PR S ID LR) takes these arguments• NEWLIN(PR,S,ID,LR) takes these arguments,• PR - RQ matrix of Q representative input vectors.

S N b f l t i th t t t• S - Number of elements in the output vector.• ID - Input delay vector, default = [0].• LR - Learning rate, default = 0.01;and returns a new linear layer.

55