supervised learning. multilayer neural networks

1

CS 2710 Foundations of AI

CS 2710 Foundations of AILecture 23

Milos [email protected] Sennott Square

Supervised learning.Multilayer neural networks.


Announcements

Homework 10: • due on Thursday, December 2, 2004

Final exam: December 14, 2004 at 1:00pm-3:00pm• Location: Sennott Square 5129• Closed book• Cumulative

2


Linear units

Logistic regressionLinear regression

∑

1

)|1( xyp =

0w

1w2w

dw

z∑

1

1x0w

1w2w

dw

dx

2x

)(xf

∑=

+=d

jjj xwwf

10)(x )(),|1()(

10 ∑

=

+===d

jjj xwwgypf wxx

jjj xfyww ))(( x−+← α jjj xfyww ))(( x−+← α

))((00 xfyww −+← αOn-line gradient update:

))((00 xfyww −+← αOn-line gradient update:

The same

=)(xf

1x

dx

2x


Limitations of basic linear units

Logistic regressionLinear regression

∑

1

)|1( xyp =

0w

1w2w

dw

z∑

1

0w

1w2w

dw

)(xf

Function linear in inputs !! Linear decision boundary!!

∑=

+=d

jjj xwwf

10)(x )(),|1()(

10 ∑

=

+===d

jjj xwwgypf wxx

1x

dx

2x1x

dx

2x

3


Extensions of simple linear units

)()(1

0 xx j

m

jjwwf φ∑

=

+=

∑)(1 xφ

)(2 xφ

)( xmφ

1

1x

0w

1w2w

mwdx

)(xjφ - an arbitrary function of x

• use feature (basis) functions to model nonlinearities

))(()(1

0 xx j

m

jjwwgf φ∑

=

+=

Linear regression Logistic regression


Regression with the quadratic model.

Limitation: linear hyper-plane onlya non-linear surface can be better

4


Classification with the linear model.

Logistic regression model defines a linear decision boundary• Example: 2 classes (blue and red points)

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2Decision boundary


Linear decision boundary• logistic regression model is not optimal, but not that bad

-4 -3 -2 -1 0 1 2 3 4 5 6-4

-3

-2

-1

0

1

2

3

4

5

5


When logistic regression fails?

• Example in which the logistic regression model fails

-4 -3 -2 -1 0 1 2 3 4 5-4

-3

-2

-1

0

1

2

3

4

5


Limitations of linear units.

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

• Logistic regression does not work for parity functions-no linear decision boundary exists

Solution: a model of a non-linear decision boundary

6


Example. Regression with polynomials.Regression with polynomials of degree m• Data points: pairs of • Feature functions: m feature functions

• Function to learn:i

m

iii

m

ii xwwxwwxf ∑∑

==

+=+=1

01

0 )(),( φw

ii xx =)(φ

>< yx,

mi ,,2,1 K=

∑x=)(1 xφ2

2 )( x=xφ

mm x=)(xφ

1

x

0w

1w2w

mw


Learning with extended linear units

)()(1

0 xx j

m

jjwwf φ∑

=

+=

Feature (basis) functions model nonlinearities

Important property:• The problem of learning the weights is the same as it was for

the linear units• Trick: we have changed the inputs – but the weights are still

linear in the new input

∑)(1 xφ

)(2 xφ

)( xmφ

1

1x

0w

1w2w

mwdx

))(()(1

0 xx j

m

jjwwgf φ∑

=

+=

Linear regression Logistic regression

7


Function to learn:

On line gradient update for the <x,y> pair

Gradient updates are of the same form as in the linear and logistic regression models

)()),(( xwx jjj fyww φα −+=

)),((00 wxfyww −+= α

Learning with feature functions.

)(),(1

0 xwwxf i

k

iiφ∑

=

+=w


Example: Regression with polynomials of degree m

• On line update for <x,y> pair

jjj xfyww )),(( wx−+= α

)),((00 wxfyww −+= α

Example. Regression with polynomials.

im

iii

m

ii xwwxwwxf ∑∑

==

+=+=1

01

0 )(),( φw

8


Multi-layered neural networks

• Alternative way to introduce nonlinearities to regression/classification models

• Idea: Cascade several simple neural models with logistic units. Much like neuron connections.


Multilayer neural network

Hidden layer Output layerInput layer

Cascades multiple logistic regression unitsAlso called a multilayer perceptron (MLP)

∑1

1x)|1( xyp =

)1(1,0w

)1(1,kw)1(2,kw

dx

2x)2(1z

)1(2,0w

∑

)1(1z

)1(2z

1

)2(1,0w

)2(1,1w

)2(1,2w

Example: (2 layer) classifier with non-linear decision boundaries

9



• Models non-linearities through logistic regression units• Can be applied to both regression and binary classification

problems

∑1

),|1()( wxx == ypf

)1(1,0w

)1(1,kw)1(2,kw

)2(1z

)1(2,0w

∑

)1(1z

)1(2z

)2(1,0w

)2(1,1w

)2(1,2w

Hidden layer Output layerInput layer

1),()( wxx ff =

regression

classification

option

1x

dx

2x



• Non-linearities are modeled using multiple hidden logistic regression units (organized in layers)

• The output layer determines whether it is a regression or a binary classification problem

),|1()( wxx == ypf

Hidden layers Output layerInput layer

),()( wxx ff =

regression

classification

option

1x

dx

2x

10


Learning with MLP

• How to learn the parameters of the neural network?• Online gradient descent algorithm

– Weight updates based on online error:

• We need to compute gradients for weights in all units• Can be computed in one backward sweep through the net !!!

• The process is called back-propagation

),(online wij

jj DJw

ww∂∂

−← α

),(online wiDJ


Backpropagation

∑)1( +kxl)(kxi

)(, kw ji )(kzi )1( +kzl)1(, +kw il

∑)1( −kx j

k-th level (k+1)-th level(k-1)-th level

)(kxi - output of the unit i on level k)(kzi - input to the sigmoid function on level k

∑ −+=j

jjiii kxkwkwkz )1()()()( ,0,

))(()( kzgkx ii =

)(, kw ji - weight between units j and i on levels (k-1) and k

11


Backpropagation

)(kiδ

))(()( wx,fyKi −−=δ

)(, kw jiUpdate weight using a data point),(

)()()(

,,, wuonline

jijiji DJ

kwkwkw

∂∂

−← α

),()(

)( wuonlinei

i DJkz

k∂∂

=δLet

Then: )1()()(

)()(

),(),(

)( ,,

−=∂∂

∂∂

=∂

∂ kxkkw

kzkzDJ

DJkw ji

ji

i

i

uonlineuonline

ji

δw

w

S.t. is computed from and the next layer )1( +klδ

))(1)(()1()1()( , kxkxkwkk iiill

li −

++= ∑δδ

Last unit (is the same as for the regular linear units):

It is the same for the classification with the log-likelihoodmeasure of fit and linear regression with least-squares error!!!

>=< yDu ,x

)(kxi


Learning with MLP

• Online gradient descent algorithm– Weight update:

),()(

)()( online,

,, wuji

jiji DJkw

kwkw∂

∂−← α

)1()()(

)()(

),(),(

)( ,,

−=∂∂

∂∂

=∂

∂ kxkkw

kzkzDJ

DJkw ji

ji

i

i

uonlineuonline

ji

δw

w

)1()()()( ,, −−← kxkkwkw jijiji αδ

)1( −kx j

)(kiδ- j-th output of the (k-1) layer

- derivative computed via backpropagationα - a learning rate

12


Online gradient descent algorithm for MLP

Online-gradient-descent (D, number of iterations)Initialize all weightsfor i=1:1: number of iterations

do select a data point Du=<x,y> from Dset learning rate compute outputs for each unitcompute derivatives via backpropagationupdate all weights (in parallel)

end forreturn weights w

)(, kw ji

α

)1()()()( ,, −−← kxkkwkw jijiji αδ

)(kx j

)(kiδ


Xor Example.

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

• linear decision boundary does not exist

13


Xor example. Linear unit


Xor example. Neural network with 2 hidden units

14


Xor example. Neural network with 10 hidden units


MLP in practice

• Optical character recognition – digits 20x20– Automatic sorting of mails– 5 layer network with multiple output functions

10 outputs (0,1,…9)…

20x20 = 400 inputs

5 10 3000

4 300 1200

3 1200 50000

2 784 3136

1 3136 78400

layer Neurons Weights

supervised learning. multilayer neural networks

Documents