neural networks the adalineusers.isr.ist.utl.pt/~alex/aauto0910/lecture5adaline.pdflast lecture...

19
Last Lecture Summary Introduction to Neural Networks Biological Neurons Alexandre Bernardino, [email protected] Machine Learning, 2009/2010 Artificial Neurons McCulloch and Pitts TLU Rosenblatt’s Perceptron MACHINE LEARNING 09/10 Neural Networks The ADALINE

Upload: others

Post on 27-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Neural Networks The ADALINEusers.isr.ist.utl.pt/~alex/aauto0910/lecture5Adaline.pdfLast Lecture Summary Introduction to Neural Networks Biological Neurons Alexandre Bernardino, alex@isr.ist.utl.pt

Last Lecture Summary

� Introduction to Neural Networks

� Biological Neurons

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

� Artificial Neurons

� McCulloch and Pitts TLU

� Rosenblatt’s Perceptron

MACHINE LEARNING 09/ 10

Neural NetworksThe ADALINE

Page 2: Neural Networks The ADALINEusers.isr.ist.utl.pt/~alex/aauto0910/lecture5Adaline.pdfLast Lecture Summary Introduction to Neural Networks Biological Neurons Alexandre Bernardino, alex@isr.ist.utl.pt

Perceptron Limitations

� Perceptron’s learning rule is not guaranteed to converge if data is not linearly separable.

� Widrow-Hoff (1960)� Minimize the error at the output of the linear unit (e) rather than at the output of

the threshold unit (e’).

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

ADALINE – Adaptive Linear Element

� Separating hyperplane is equivalent to the perceptron

0...110

=+++ NN wxwxw

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

Page 3: Neural Networks The ADALINEusers.isr.ist.utl.pt/~alex/aauto0910/lecture5Adaline.pdfLast Lecture Summary Introduction to Neural Networks Biological Neurons Alexandre Bernardino, alex@isr.ist.utl.pt

ADALINE – Adaptive Linear Element

� Learning rule is different from the perceptron.

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

� Given the training set:

� minimize cost function: ( ) ( ) ( )∑∑==

−==P

p

ppP

p

pds

Pe

PwE

1

2

1

2 11r

( ){ } PpdxTpp

,...,1,, ==

ADALINE - Simplification

� Let us consider that, for every pattern: 10

=px

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

� Thus we can write:

∑=

=N

l

p

ll

pxws

0

Page 4: Neural Networks The ADALINEusers.isr.ist.utl.pt/~alex/aauto0910/lecture5Adaline.pdfLast Lecture Summary Introduction to Neural Networks Biological Neurons Alexandre Bernardino, alex@isr.ist.utl.pt

ADALINE – Analytic Solution

� Optimize the cost function:

� Given that:

( ) ( )

[ ]

P

p

pe

PwE

1

1

2

= ∑=

r

r pppdse −=( )1

( )3

[ ]TNwwww ...10

=r

Nkw

E

k

,...,0,0 =∀=∂

∂∑

=

=N

l

p

ll

pxws

0

( )2 ( )4

ADALINE – Analytic Solution

� Compute the gradient of cost function:

∑∂

∂=

∂ P pp e

eE

21

( ) ( )∑=

=P

p

pe

PwE

1

21r

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

∑= ∂

=∂ p kk w

ePw 1

2

( )ppP

p k

p

k

dsw

ePw

E−

∂=

∂∑

=1

21

∑= ∂

∂=

∂ P

p k

pp

k w

se

Pw

E

1

21

Page 5: Neural Networks The ADALINEusers.isr.ist.utl.pt/~alex/aauto0910/lecture5Adaline.pdfLast Lecture Summary Introduction to Neural Networks Biological Neurons Alexandre Bernardino, alex@isr.ist.utl.pt

ADALINE – Analytic Solution

∑=N

pp pp

xs

=∂

∑= ∂

∂=

∂ P

p k

pp

k w

se

Pw

E

1

21

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

∑= ∂

∂=

∂ P

p k

pp

k w

se

Pw

E

1

21

∑=

=N

l

p

ll

pxws

0

p

k

k

xw

s=

∑=

=∂

∂ P

p

p

k

p

k

xePw

E

1

21

ADALINE – Analytic Solution

� Very important !

� The partial derivative of the error function with respect a weight is proportional to the sum for all patterns of the input on that weight multiplied by the error.

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

∑=

=∂

∂ P

p

pp

k

k

exPw

E

1

2

Page 6: Neural Networks The ADALINEusers.isr.ist.utl.pt/~alex/aauto0910/lecture5Adaline.pdfLast Lecture Summary Introduction to Neural Networks Biological Neurons Alexandre Bernardino, alex@isr.ist.utl.pt

ADALINE – Analytic Solution

Given thatppp

dse −=

∑=

=∂

∂ P

p

p

k

p

k

xePw

E

1

21

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

=∂ pk Pw 1

( )∑=

−=∂

∂ P

p

p

k

pp

k

xdsPw

E

1

21

∑∑==

−=∂

∂ P

p

p

k

pP

p

p

k

p

k

xdP

xsPw

E

11

21

21

ADALINE – Analytic Solution

∑=

=N

l

p

ll

pxws

0

∑∑∂ PPE 11

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

∑∑==

−=∂

∂ P

p

p

k

pP

p

p

k

p

k

xdP

xsPw

E

11

21

21

∑∑∑== =

−=∂

∂ P

p

p

k

pP

p

N

l

p

k

p

ll

k

xdP

xxwPw

E

11 1

21

21

Page 7: Neural Networks The ADALINEusers.isr.ist.utl.pt/~alex/aauto0910/lecture5Adaline.pdfLast Lecture Summary Introduction to Neural Networks Biological Neurons Alexandre Bernardino, alex@isr.ist.utl.pt

ADALINE – Analytic Solution

Nkw

E

k

,...,0,0 =∀=∂

∑∑∑∂ PP NE 11

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

∑∑∑== =

−=∂

∂ P

p

p

k

pP

p

N

l

p

k

p

ll

k

xdP

xxwPw

E

11 1

21

21

∑∑∑== =

=P

p

p

k

pP

p

N

l

p

k

p

ll xdxxw11 1

ADALINE – Analytic Solution

NkxdxxwP

p

p

k

pP

p

N

l

p

k

p

ll ,...,0,11 1

=∀=∑∑∑== =

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

� It is a linear system of N+1 equations with N+1 unknowns. How to solve it ?

Page 8: Neural Networks The ADALINEusers.isr.ist.utl.pt/~alex/aauto0910/lecture5Adaline.pdfLast Lecture Summary Introduction to Neural Networks Biological Neurons Alexandre Bernardino, alex@isr.ist.utl.pt

ADALINE – Matrix Notation

[ ]TNwwww ...10

=r

[ ] 1,...010

== pTp

N

pppxxxxx

r

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

[ ]010 N

pTN

l

p

ll

pxwxwsrr

==∑=0

ppTpdxwe −=

rr

ADALINE – Matrix Notation

( ) ( ) ( )∑∑==

−==P

p

ppTP

p

pdxw

Pe

PwE

1

2

1

2 11 rrr

( ) ( )( )∑=

−−=P

p

TppTppTdxwdxw

PwE

1

1 rrrrr

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

∑=pP 1

( ) ( )( )pTpP

p

ppTdwxdxw

PwE −−= ∑

=

rrrrr

1

1

( ) ( ) ( )( )∑=

+−−=P

p

pTppppTTppTdwxddxwwxxw

PwE

1

21 rrrrrrrrr

Page 9: Neural Networks The ADALINEusers.isr.ist.utl.pt/~alex/aauto0910/lecture5Adaline.pdfLast Lecture Summary Introduction to Neural Networks Biological Neurons Alexandre Bernardino, alex@isr.ist.utl.pt

ADALINE – Matrix Notation

( ) ( ) ( )( )∑=

+−−=P

p

pTppppTTppTdwxddxwwxxw

PwE

1

21 rrrrrrrrr

=

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

( ) ( ) ( )∑ ∑∑= ==

+−=P

p

P

p

pppTP

p

TppTd

Pdxw

Pwxxw

PwE

1 1

2

1

112

1 rrrrrrr

( ) ( ) ( )∑∑∑===

+

=

P

p

pP

p

ppTP

p

TppTd

Pdx

Pwwxx

PwwE

1

2

11

112

1 rrrrrrr

ADALINE – Matrix Notation

� Let us introduce the average operator < >:

( )∑=

⋅=⋅P

pP 1

1

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

� The cost function is written as:

( ) ( ) ( )2

2pppTTppT

ddxwwxxwwE +−=rrrrrr

Page 10: Neural Networks The ADALINEusers.isr.ist.utl.pt/~alex/aauto0910/lecture5Adaline.pdfLast Lecture Summary Introduction to Neural Networks Biological Neurons Alexandre Bernardino, alex@isr.ist.utl.pt

ADALINE – Matrix Notation

� Defining: ( ) ( )TppP

p

Tpp

xx xxxxP

Rrrrr

== ∑=1

1

ppP

p

ppdxdx

Pp

rrr== ∑

=1

1

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

� the cost function is:

( ) 22 d

T

xx

TpwwRwwE σ+−=rrrrr

( ) ( )2

1

22 1 pP

p

p

d ddP

== ∑=

σ

ADALINE – Quadratic Cost

� Rxx is a covariance matrix –positive semi-definite

( ) 22 d

T

xx

TpwwRwwE σ+−=rrrrr

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

positive semi-definite

� The error function surface is a parabola.

Page 11: Neural Networks The ADALINEusers.isr.ist.utl.pt/~alex/aauto0910/lecture5Adaline.pdfLast Lecture Summary Introduction to Neural Networks Biological Neurons Alexandre Bernardino, alex@isr.ist.utl.pt

ADALINE – Gradient Vector

( ) 22 d

T

xx

TpwwRwwE σ+−=rrrrr

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

( ) pwRwE xx

rrr22 −=∇

ADALINE – Closed Form Solution

( ) pwRwE xx

rrr=⇔=∇ *0*

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

� If Rxx is positive definite, the minimum is unique.

pRw xx

rr 1*

−=

Page 12: Neural Networks The ADALINEusers.isr.ist.utl.pt/~alex/aauto0910/lecture5Adaline.pdfLast Lecture Summary Introduction to Neural Networks Biological Neurons Alexandre Bernardino, alex@isr.ist.utl.pt

ADALINE – Closed Form Solution

� Closed form solution requires the inversion of the covariance matrix, which can be problematic in high dimensions.

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

� Gradient methods are simpler and have proved convergence properties for quadratic functions.

( ) ( )

k

t

k

t

kw

Eww

∂−=+ η1

ADALINE – Gradient Based Solution

� Remember ~10 slides back:

( ) ( )∑=

=P

p

pe

PwE

1

21r

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

∑=

=∂

∂ P

p

pp

k

k

exPw

E

1

2

Page 13: Neural Networks The ADALINEusers.isr.ist.utl.pt/~alex/aauto0910/lecture5Adaline.pdfLast Lecture Summary Introduction to Neural Networks Biological Neurons Alexandre Bernardino, alex@isr.ist.utl.pt

ADALINE – Gradient Based Solution

∑=

=∂

∂ P

p

pp

k

k

exPw

E

1

2

P1

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

=∂ pk Pw 1

( ) ( )

k

t

k

t

kw

Eww

∂−=+ η1

( ) ( ) ∑=

+ −=P

p

pp

k

t

k

t

k exP

ww1

12

ADALINE – Batch Algorithm

� Initialize weigths at arbitrary values

� Define a learning rate η.

� Repeat:

� For each pattern in the training set

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

� Apply xp to the adaline input

� Observe the output sp and compute the error ep = sp –dp

� For each weight k, accumulate the product xkpep.

� After processing all patterns, update each weight k by:

( ) ( ) ∑=

+ −=P

p

pp

k

t

k

t

k exP

ww1

12

Page 14: Neural Networks The ADALINEusers.isr.ist.utl.pt/~alex/aauto0910/lecture5Adaline.pdfLast Lecture Summary Introduction to Neural Networks Biological Neurons Alexandre Bernardino, alex@isr.ist.utl.pt

ADALINE – Batch Algorithm

ADALINE’s batch algorithm properties:

� Guarateed to converge to weight set with minimum squared error:

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

� given sufficiently small learning rate η.

� Even when training data contains noise.

� Even when training data is not separable.

ADALINE – Batch Algorithm

� ADALINE’s batch algorithm requires the availability of all the training data from the beginning.

� The weights are updated only after presenting the whole training data.

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

training data.

� But humans learn continuously!

� In some applications we may want to update the weights immediately after each training pattern is available.

Page 15: Neural Networks The ADALINEusers.isr.ist.utl.pt/~alex/aauto0910/lecture5Adaline.pdfLast Lecture Summary Introduction to Neural Networks Biological Neurons Alexandre Bernardino, alex@isr.ist.utl.pt

ADALINE – Incremental Algorithm

� Incremental Algorithm – approximate the complete gradient by its estimate for each pattern.

∂ PE 2

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

∑=

=∂

∂ P

p

pp

k

k

exPw

E

1

2

pp

k

k

exw

E=

∂ ˆ

Complete (exact) gradient

Stochastic (approximate) gradient

ADALINE – Incremental Algorithm

� Incremental mode gradient descent

� Batch mode gradient descent:

( ) ( ) pp

k

t

k

t

k exww η21 −=+ ( ) ( ) ∑

=

+ −=P

pp

k

t

k

t

k exP

ww1

21

ηkkk ∑

=pP 1

Incremental Gradient Descent can approximate Batch gradient descent arbitrarily closely if η is made small enough.

Page 16: Neural Networks The ADALINEusers.isr.ist.utl.pt/~alex/aauto0910/lecture5Adaline.pdfLast Lecture Summary Introduction to Neural Networks Biological Neurons Alexandre Bernardino, alex@isr.ist.utl.pt

ADALINE – Incremental Algorithms

� Incremental gradient descent is also known as stochastic gradient descent.

� It is also called LMS algorithm or the Delta Rule.

� It is based on an approximation of the gradient so it never

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

� It is based on an approximation of the gradient so it never goes exactly to the minimum of the cost function.

� After reaching a vicinity of the minimum, it oscilates around it.

� The amplitude of the oscilations can be reduced by reducing η.

ADALINE - Comparison

� The plots show the value of one weight along time

Batch Incremental

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

Epochs Patterns

1 Epoch = P patterns (the full training set)

Page 17: Neural Networks The ADALINEusers.isr.ist.utl.pt/~alex/aauto0910/lecture5Adaline.pdfLast Lecture Summary Introduction to Neural Networks Biological Neurons Alexandre Bernardino, alex@isr.ist.utl.pt

ADALINE vs Perceptron

� Both ADALINE Delta Rule and Perceptron weight update rule are instances of Error Correction Learning.

ADALINE Delta Rule Perceptron update rule

( ) ( ) ( )ppp

k

t

k

t

k dsxww −−=+ η21 ( ) ( ) ( )ppp

k

t

k

t

k dyxww −−=+ η1

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

� The ADALINE allows abritrary real values in the output values whereas the perceptron assumes binary outputs.

� The ADALINE always converges (given small enough η) to the minimum squared error, while the perceptron only converges when data is separable.

( )kkk dsxww −−= η2 ( )kkk dyxww −−= η

ADALINE – Statistical Interpretation

� The analytical solution for the weights was obtained “averaging” quantities obtained from the training set.

� It is possible to make a statistical interpretation of the process:

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

the process:

� Inputs: Observarions of N Random Variables

� X = [1, X1, ..., Xk, ..., XN]

� Desired Output: Observations of 1 Random Variable

� D

� Output: Observations of 1 Random Variable

� Y = wTX

Page 18: Neural Networks The ADALINEusers.isr.ist.utl.pt/~alex/aauto0910/lecture5Adaline.pdfLast Lecture Summary Introduction to Neural Networks Biological Neurons Alexandre Bernardino, alex@isr.ist.utl.pt

ADALINE – Statistical Interpretation

� The error function can be interpreted as an approximation to the statistical expectation E[.]:

Solution

( ) ( ) ( )[ ]2

1

21DYEdy

PwE

P

p

pp −≈−= ∑=

r

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

Solution

� Matrix Rxx and vector p can be interpreted as approximations to the statistical auto-covariance and cross-covariance between random variables:

pRw xx

rr 1*

−=

( ) ][1

1

TP

p

Tpp

xx XXExxP

R ≈= ∑=

rr[ ]DXEdx

Pp

P

p

pp ≈= ∑=1

1 rr

ADALINE – Statistical Interpretation

� The LMS algorithm is based on an instantaneous estimate of the gradient.

� This estimate can be modeled by:

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

where eg(n) is a random noise vector.

� LMS = stochastic gradient descent

( )nengng g+= )()(ˆ

Page 19: Neural Networks The ADALINEusers.isr.ist.utl.pt/~alex/aauto0910/lecture5Adaline.pdfLast Lecture Summary Introduction to Neural Networks Biological Neurons Alexandre Bernardino, alex@isr.ist.utl.pt

ADALINE – Statistical Interpretation

� Under reasonable conditions, stochastic gradient methods may converge to the exact solution.

� Convergence Conditions [Monro and Ljung]:

� eg(n) is zero mean.

Pattern sequence is random.

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

� Pattern sequence is random.

� η(n) tends slowly toward zero.

( ) ∞<∑∞

=0

2

n

nη ( ) ∞=∑∞

=0n

ADALINE – Statistical Interpretation

� Typical learning rate schedules

( )n

cn =η

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Alexandre Bernardino, [email protected] Machine Learning, 2009/2010

( )

τ

ηη

nn

+

=

1

0

0 100 200 300 400 500 600 700 800 900 10000

0.1

0.2

0.3

0 100 200 300 400 500 600 700 800 900 10000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1