lecture11 - neural networks

Introduction to Machine Learning

Lecture 11N l N t kNeural Networks

Albert Orriols i [email protected]

Artificial Intelligence – Machine Learningg gEnginyeria i Arquitectura La Salle

Universitat Ramon Llull

Recap of Lecture 5-10Data classification

Decision trees (C4.5)

Instance-based learners (kNN and CBR)

Slide 2Artificial Intelligence Machine Learning

Recap of Lecture 5-10Data classification

Probabilistic-based learners

)()|( hPhDP)(

)()|()|(DP

hPhDPDhP =

Linear/polynomial classifier


Today’s Agenda

Why Neural Networks?Looking into a BrainNeural NetworksNeural NetworksStarting from the Beginning:

PerceptronsMulti-layer perceptrons


Why Neural Networks?Brain vs. machines

Machines are tremendously faster than brains in well-defined problems:

Invert matrices solve differential equations etcInvert matrices, solve differential equations, etc.

Brains are tremendously faster and more accurate than machines in ill-defined methods or problems that require a lot p qof processing

Recognize the character of objects in TV

Let’s simulate our brains with artificial neural networks!Massive parallelism

Neurons interchanging signals


Looking into a Brain1011 neurons of more than 20 different types

0.001 seconds of neuron switching time

104-5 connections per neuron

0.1 seconds of scene recognition time


Artificial Neural NetworksBorrow some ideas from nervous systems of animals

)()( , jijjii aWginga ∑==


THE PERCEPTRON (McCulloch & Pitts)

AdalineAdaptive Linear Element

Adaptive linear combiner cascaded with a hard-limiting quantizer

Linear output transformed to binary by means of a threshold device

Training = adjusting the weights

Activation functions


AdalineNote that Adaline implements a function

∑+=n

iiiwxwwxf

10),( rr

This defines a threshold when the output is zero

=i 1

This defines a threshold when the output is zero

0),( 0 =+= ∑n

iiwxwwxf rr 0),(1

0 +∑=i

iiwxwwxf


AdalineLet’s assume that we have two variables

0)( ++ wxwxwwxf rr

Therefore

0),( 22110 =++= wxwxwwxf

01 wxwx −−=2

12

2 wx

wx −−=

So, Adaline is drawing a linear , gdiscriminant that divides the space into two regions

Linear classifierLinear classifier


AdalineSo, we got a cool way to create linear classifiers

But are linear classifiers enough to tackle our problems?

Can you draw a line that separates examples of class whiteCan you draw a line that separates examples of class white and black for the last example?


Moving to more Flexible NNSo, we want to classify problems such as x-or. Any idea?

Polynomial discriminant functions

In this system:


0),( 222222122111

21110 =+++++= wxwxwxxwxwxwwxf rr

Moving to more Flexible NN

With appropriate values of w, I can fit data that is not linearly separable


Even more Flexible: Multi-layer NN

So, we want to classify problems such as x-or. Any other idea?

Madaline: Multiple Adalines connected

This also enables the network to solve non-separable problems


But Step Down… How Do I Learn w?

We have seen that different structures enable me to define different functionsdefine different functions

But the key is to get a proper estimation of w

There are many algorithmsPerceptron rule

α-LMS

α-perceptron

May’s algorithm

Backpropagationp p g

We are going to see two examples: α-LMS and backprop.


Weight Learning in AdalineRecall that we want to adjust w


Weight Learning in AdalineWeight learning with α-LMS algorithm

XεIncrementally update weights as 21

k

kkkk

XXWW εα+=+

The error is the difference between the actual and the expected output k

Tkkk XWd −=+1ε

A change in the weights effects the error k

Tkk

Tkkk WXXWd Δ−=−Δ=Δ )(ε

XεAnd the weight change is 21

k

kkkkk

XXWWW εα=−=Δ +

Therefore kk

Tkk

kXX αεεαε −=−=Δ 2

Slide 17

Therefore

Artificial Intelligence Machine Learning

kk

kX 2

Weight Learning in Adaline

kk

k XX

W 2εα=Δk

Tkk WX ΔεΔ −=

kX


Backpropagationα-LMS works for networks with a single layer. But what happens in networks with multiple layers?happens in networks with multiple layers?

Backpropagation (Rumelhat, 1986)The most influential development of NN in the 1980s

Here, we present the method conceptually (the math details are in the papers)in the papers)

Let’s assume a network withThree neurons in the input layer

Two neurons in the output layer


BackpropagationStrategy

Compute the gradient of the error

2∂ε

k

kk W

ˆ∂∂

=∇ε

Adjust the weights in the direction opposite to the instantaneous error gradient

Now, Wk is a vector that contains all the components of the net


BackpropagationAlgorithm1. Insert a new example Xk into the network and sweep it forward

till getting the output y

C t th f thi tt ib t2. Compute the square error of this attribute

( )∑∑ −==yy N

2ikik

N2

ik2

k ydεε

For example, for two outputs (disregarding k)

∑∑== 1i1i

P t th t th i l (b k ti )

( ) ( )2222

112 ydyd −+−=ε

3. Propagate the error to the previous layer (back-propagation). How?

Steepest descent

Slide 21

pCompute the derivative of the square error δ for each Adaline

Artificial Intelligence Machine Learning

Backpropagation ExampleExample borrowed from: http://home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html


Backpropagation Example1. Sweep the weights forward


Backpropagation Example2. Backpropagate the error


Backpropagation Example3. Modify the weights of each neuron


Backpropagation Example3.bis. Do the same of each neuron


Backpropagation Example3.bis2. Until reaching the output


Backpropagation for a Two-Layer Net.

That is, the algorithm is1. Find the instantaneous square error derivative

2)l( 1 ∂εδ

This tells us how sensitive is the square output error of the net ork is to changes in the linear o tp t s of the associated

)l(j

)l(j s2 ∂

−=δ

network is to changes in the linear output s of the associated Madaline

2. Expanding the error term we getp g g

[ ])2(

1

2)2(11

)2(1

222

211)2(

1 s)s(sgmd

21

s)yd()yd(

21

∂−∂

−=∂

−+−∂−=

][δ

3. And recognizing that d1 is independent of s1

11 s2s2 ∂∂

)2()2()2()2()2(


)s('sgm)s('sgm)s(sgmd )2(1

)2(1

)2(1

)2(11

)2(1 εδ =−= ][


That is, the algorithm is4. Similarly for the hidden layers we have

⎟⎞

⎜⎛ ∂∂∂∂∂ )2(

22)2(

122

)1( ss11 εεεδ ⎟⎟⎠

⎜⎜⎝ ∂

∂∂∂

+∂∂

∂∂

−=∂∂

−= )1(1

2)2(

2)1(

1

1)2(

1)1(

1

)1(1 s

sss

ss2

1s2

1 εεεδ

)2()2(

5. That is)1(

1

)2(2)2(

2)1(1

)2(1)2(

1)1(

1 ss

ss

∂∂

+∂∂

= δδδ

4. Which yields)1(

23

1i

)2(i1

)2(20

)1(i

3

1i

)2(i1

)2(10 )s(sgmww)2()s(sgmww)2()1( ⎥⎦

⎤⎢⎣⎡

∑+∂⎥⎦⎤

⎢⎣⎡

∑+∂== + δδδ )1(

1

1i)1(

1

1i

s)(

2s)(

1)(

1 ∂⎦⎣

∂⎦⎣ == += δδδ

)s('sgmw)s('sgmw )1(1

)2(21

)2(2

)1(1

)2(11

)2(1 δδ +=


[ ] )s('sgmww )1(1

)2(21

)2(2

)2(11

)2(1 δδ +=


Defining )2(21

)2(2

)2(11

)2(1

)1(1 ww δδε +=

Δ

We obtain )s('sgm )1(1

)1(1

)1(1 εδ =

Slide 30

Implementation details of each Adaline

Next Class

Support Vector MachinesSupport Vector Machines


Introduction to Machine Learning

Lecture 11N l N t kNeural Networks

Albert Orriols i [email protected]

Artificial Intelligence – Machine Learningg gEnginyeria i Arquitectura La Salle

Universitat Ramon Llull

lecture11 - neural networks

Documents