lecture4: perceptron and adaline - mshdiau.ac.irghoshuni.mshdiau.ac.ir/ann/files/chap4.pdf ·...
TRANSCRIPT
In The name of God
Lecture4:Lecture4:
Perceptron and ADALINEPerceptron and ADALINE
D M jid Gh h iDr. Majid Ghoshuni
1
IntroductionIntroduction
Th R bl ’ LMS l i h f P (1958)• The Rosenblatt’s LMS algorithm for Perceptron (1958)is built around a linear neuron (a neuron with a linearactivation function)activation function).
• However, the Perceptron is built around a nonlinearneuron namely the McCulloch-Pitts model of a neuronneuron, namely the McCulloch-Pitts model of a neuron.
• This neuron has a hard- limiting activation function(performing the signum function)(performing the signum function).
• Recently the term multilayer Perceptron has often beenused as a synonym for the term multilayer feedforwardused as a synonym for the term multilayer feedforwardneural network. In this section we will be referring tothe former meaning.
2
Perceptron(1)Perceptron(1)
• Goall if i li d I t i t f t l– classifying applied Input into one of two classes
• Procedure– if output of hard limiter is +1, to class C1; – if it is -1, to class C2
• input of hard limiter : weighted sum of input– effect of bias b is merely to shift decision
boundary away from originboundary away from origin– synaptic weights adapted on
iteration by iteration basis
3
Perceptron(2)Perceptron(2)
• Decision regions separated by a hyper plane
– point (x1,x2) above boundary
line is assigned to C1g 1
– point (y1,y2) below boundary
line to class Cline to class C2
4
Perceptron Learning Theorem (1)Perceptron Learning Theorem (1)
Li l bl• Linearly separable– if two classes are linearly separable, there exists decision
surface consisting of hyper planesurface consisting of hyper plane.– If so, there exists weight vector w
wTx > 0 for every input vector x belonging to class Cw x > 0 for every input vector x belonging to class C1
wTx ≤ 0 for every input vector x belonging to class C2
• for only linearly• for only linearlyseparable classes,
t kperceptron workswell
5
Perceptron Learning Theorem (2)Perceptron Learning Theorem (2)
• Using modified signal-flow graph– bias b(n) is treated as synaptic weight driven by fixed input
+1
– w0(n) is b(n)
– linear combiner output
6
Perceptron Learning Theorem (3)Perceptron Learning Theorem (3)
i• Weight adjustment– if x(n) is correctly classified
– Otherwise
learning rate parameter η(n) controls adjustment– learning rate parameter η(n) controls adjustmentapplied to weight vector
7
Summary of LearningSummary of Learning
1 I iti li ti1. Initialization– set w(0)=0
2. Activation– at time step n, activate perceptron by applying continuous valued input
vector x(n) and desired response d(n)3. Computation of actual responsep p
y(n) = sgn[wT(n) x(n)]4. Adaptation of Weight Vector
5. Continuation– increment time step n and go back to step 2
8
The network is capable of solving linearlyseparable problemp p
9
Learning ruleLearning rule
• An algorithm to update the weights w so thatfinally the input patterns lie on both sides ofy p pthe line decided by the perceptron
• Let t be the time at t = 0 we have• Let t be the time, at t = 0, we have
10
Learning ruleLearning rule
• An algorithm to update the weights w so thatfinally the input patterns lie on both sides ofy p pthe line decided by the perceptron
• Let t be the time at t = 1 we have• Let t be the time, at t = 1, we have
11
Learning ruleLearning rule
• An algorithm to update the weights w so thatfinally the input patterns lie on both sides ofy p pthe line decided by the perceptron
• Let t be the time at t = 2 we have• Let t be the time, at t = 2, we have
12
Learning ruleLearning rule
• An algorithm to update the weights w so thatfinally the input patterns lie on both sides ofy p pthe line decided by the perceptron
• Let t be the time at t = 3 we have• Let t be the time, at t = 3, we have
13
Implementation of Logical NOT, AND,and OR
14
Implementation of Logical Gate
15
Finding Weights by MSE MethodFinding Weights by MSE Method
• Write a equation for each training data
• Output for first class is +1 and for second classOutput for first class is +1 and for second class is -1(or 0)
A l h MSE h d l h bl• Apply the MSE method to solve the problem
• Example: Implementation of AND gate 1001
p e: p e e o o g e
1
5.11
1
*1
0
0
0
1
1bb
1
11
*011
2
1
2
1
w
w
w
w
16
1111 22
Summary: Perceptron vs MSE proceduresSummary: Perceptron vs. MSE procedures
• Perceptron rule– The perceptron rule always finds aThe perceptron rule always finds a
solution if the classes are linearly separable.But does not converge if the classes are– But does not converge if the classes are not-separable.
• MSE criterion– The MSE solution has guaranteed
convergence, but it may not find a separating hyperplane if classes areseparating hyperplane if classes are linearly separable
• Notice that MSE tries to minimize the sum of the squares of the distances of the training datathe squares of the distances of the training data to the separating hyperplane.
17
Convergence of the Perceptronlearning law
• Rosenblatt proved that if input patterns arelinearly separable, then the Perceptron learningy p , p glaw converges, and the hyperplane separatingtwo classes of input patterns can betwo classes of input patterns can bedetermined.
• Fixed increment convergence theorem forlinearly separable vectors X1 and X2 ,y pperceptron converges after some n0 iterations
18
Limitation of PerceptronLimitation of Perceptron
• The XOR problem (Minsky): nonlinearseparabilityp y
19
Perceptron with sigmoidactivation function• For single neuron with step activation function:
• For single neuron with Sigmoid activation function:
20
Representation of Perceptron inRepresentation of Perceptron in MATLAB
21
MATLAB TOOLBOXMATLAB TOOLBOX
( f lf)• net = newp(p,t,tf,lf)• Description of function
– Perceptrons are used to solve simple (i.e. linearlyseparable) classification problems.
• NET = NEWP(P T TF LF) takes two inputs• NET = NEWP(P,T,TF,LF) takes two inputs,– P : R-by-Q matrix of Q input vectors of R elements each..
T : S by Q matrix of Q target vectors of S elements each– T : S-by-Q matrix of Q target vectors of S elements each.– TF: Transfer function, default = 'hardlim'.– LF: Learning function default = 'learnp'LF: Learning function, default learnp .
Returns a new perceptron.
22
Classification of data: nonlinear separability
27
ADALINE:The Adaptive Linear Element
• ADALINE is Perceptron with linear activation function
• This is proposed by Widrow
28
Applications of AdalineApplications of Adaline
• In general, the Adaline is used to perform– Linear approximation of a “small” segment of app g
nonlinear hyper surface, which is generated by ap– variable function, y = f( x). In this case, the biasp , y ( ) ,is usually needed.
– Linear filtering and prediction of data (signals);Linear filtering and prediction of data (signals);
– Pattern association, that is, generation of m–element output vectors associated with respectiveelement output vectors associated with respective p–element input vectors.
29
Error conceptError concept
F i l• For single neuronε = d-y
• For multi neuronεi = di-yi i=1:m
dεm×1= dm×1-ym×1
– m is number of output neuronTh t t l f th d f• The total measure of the goodness ofapproximation, or the performance index, can bespecified by the mean- squared error over mspecified by the mean squared error over mneurons and N training vectors:
N m130
N
i
m
jj ie
mNWJ
1 1
2 )(1
)(
W :eight
m
m
ww
ww
ww
221
111
mpW
g
pmp ww 1
31
• The MSE solution is:
mNt
NppNt
Npmp DXXXW
1)(
• The Error equation is:
1
mmN
tNm EE
NWJ )(
1)(
mppNmN WXDE
32
• For single neuron m=1:
1)(
1)( 11 N
tN EE
NWJ
• Replacing error in equation:)]()[(
1)( XWDXWDWJ t
)])([(1
)]()[()(
XWDXWDN
XWDXWDN
WJ
ttt
][1
XWXWDXWXWDDDN
N
tttttt
33
]2[1
XWXWXWDDDN
tttt
10
1.1
Example1:
12
11
2.3
8.1
14
13
X
8.4
1.4
D
1w
W
16
15X
3.7
7.5D
2
1
wW
18
17
2.9
9.7
19
9.9
11 ]45285
][]55330[*238385[1
)(w
www
wwJ
22
221
221
109028511066038385[1
)(
]1045
][]55330[*238.385[10
),(
J
www
wwwJ
34
2221
212121 109028511066038.385[
10),( wwwwwwwwJ
The plot of performance indexJ(w1,w2) of example1
35
Example 2:the performance index in general case
36
Method of steepest descentMethod of steepest descent
f i l h d f l l i ill b hi h• If N is large the order of calculation will be high.• In order to avoid this problem, we can find the
optimal weight vector for which the mean-squared error, J(w).
• J(w) attains minimum by iterative modification ofthe weight vector for each training exemplar ing g pthe direction opposite to the gradient of theperformance index, J(w),
• An example illustrated in Figure 4– 5 for a singleweight situation.g
37
Illustration of the steepest descentmethod
38
Method of steepest descentMethod of steepest descent
Wh th i ht t tt i th ti l l• When the weight vector attains the optimal valuefor which the gradient is zero (w0 in Figure 4–5),the iterations are stoppedthe iterations are stopped.
• More precisely, the iterations are specified as
• where the weight adjustment, w(n), isproportional to the gradient of the mean squaredproportional to the gradient of the mean-squarederror
• where η is a learning gain.
39
The LMS (Widrow Hoff) Learning LawThe LMS (Widrow‐ Hoff) Learning Law
• The Least- Mean- Square learning law replacesthe gradient of the mean- squared error withg qthe gradient update and can be written infollowing form:following form:
)()()( nnxnW mNt
Npmp
:1
d
miyd iii
)()()()1( WW
ydt
mNmNmN
40
)()()()1( nnxnWnW t
The LMS (Widrow‐ Hoff) Learning Law
• For single neuron
liF neuronnonlinearFor
ii xwy
neuronlinear For
ii xwv
neuronnonlinear For
xwdJ
yd
)(11
)(
22
yd
vy
)(
)(
ii
xxwdyJ
xwdJ
)(
)(22
vdJ ))((
2
1
2
1 22 iii
ii
xxwdww
)(
iii
xvvdw
v
v
J
w
J)())((
41
ii
network trainingnetwork training
• Two types of network training:– Sequential mode or incremental (on-line, q ( ,
stochastic, or per-pattern):• Weights updated after each pattern is presentedWeights updated after each pattern is presented
– Batch mode (off-line or per-epoch) :• Weights updated after all pattern is presented• Weights updated after all pattern is presented
42
Some general comments on the learning process
C t ti ll th l i• Computationally, the learning process goesthrough all training examples (an epoch) numberof times until a stopping criterion is reachedof times, until a stopping criterion is reached.
• The convergence process can be monitored withthe plot of the mean- squared error functionthe plot of the mean squared error functionJ(W(n)).
• The popular stopping criteria are:The popular stopping criteria are:– the mean- squared error is sufficiently small:
– The rate of change of the mean- squared error issufficiently small:
43
Th ff t f l i R t ƞThe effect of learning Rate: ƞ
44
Applications (1)Applications (1)
• MA (Moving average) modeling (filtering)
• For m=2 012 xxx b2
y
123
012
xxx
X 1
0
b
b
w3
y
D
453)1(21
NNNN xxx
132
b 1)1(
NNy
Applications (2)Applications (2)
• AR (auto regressive) modeling:
M
M
ii Mnbxinyany
1
Model ofOrder :)()()(
• For M=2:
xyy 2 y
312
201
xyy
xyy
X
a
a
w 2
13
2
y
y
D
3)1(21
NNNN xyy
X
b
aw 2
y
D
46
3)1( N3)1( NNy
Applications (3)Applications (3)
• PID controller:
47
Simulation of MA modelingSimulation of MA modeling
• Suppose the MA model as:
22
1
Mb 2
3
2
Mb
• Input is Gaussian noise with mean=0 and var=1
• y is Calculated by recursive equation• y is Calculated by recursive equation
• Please see the M_file
48
M file of MA ModelingM_file of MA Modeling
49
MAModelingMA Modeling
• Zeros weight initial and η= 0.01
• Number of data training set: N=20Number of data training set: N 20
50
MAModelingMA Modeling
• Zeros weight initial and η= 0.1
• Number of data training set: N=20Number of data training set: N 20
51
MAModelingMA Modeling
• Random weight initial and η= 0.01
• Number of data training set: N=20Number of data training set: N 20
52
MAModelingMA Modeling
• Random weight initial and η= 0.1
• Number of data training set: N=20Number of data training set: N 20
53
MAModelingMA Modeling
• Random weight initial and η= 0.1
• Number of data training set: N=10Number of data training set: N 10
54
MATLAB TOOLBOXMATLAB TOOLBOX
li (PR S ID LR)• net = newlin(PR,S,ID,LR)• Description of function
– Linear layers are often used as adaptive filters for signal processing and prediction.
• NEWLIN(PR S ID LR) takes these arguments• NEWLIN(PR,S,ID,LR) takes these arguments,• PR - RQ matrix of Q representative input vectors.
S N b f l t i th t t t• S - Number of elements in the output vector.• ID - Input delay vector, default = [0].• LR - Learning rate, default = 0.01;and returns a new linear layer.
55