csc446: pattern recognition (ln8)

35
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1 Lecture Note 8: Ch5: Linear Discriminant Functions Ch6: Multilayer Neural Networks CSC446 : Pattern Recognition Prof. Dr. Mostafa Gadal-Haqq Faculty of Computer & Information Sciences Computer Science Department AIN SHAMS UNIVERSITY

Upload: mostafa-g-m-mostafa

Post on 13-Apr-2017

95 views

Category:

Data & Analytics


4 download

TRANSCRIPT

Page 1: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Lecture Note 8:

Ch5: Linear Discriminant Functions

Ch6: Multilayer Neural Networks

CSC446 : Pattern Recognition

Prof. Dr. Mostafa Gadal-Haqq

Faculty of Computer & Information Sciences

Computer Science Department

AIN SHAMS UNIVERSITY

Page 2: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 2

1. Introduction

2. Linear Classifier & Decision Surface

2-1. The Two-Category Case

2-2. The Multi-Category Case

3. Unconstrained Optimization methods:

3-1. The Gradient Descent method

3-2. The Newton’s Descent method

Linear Discriminant Functions

Page 3: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 3

• In Bayesian Decision theory:

– the underlying probability densities were known (or

given). The training sample is used to estimate the

parameters of these probability densities.

• In Linear Discriminant functions:

– we only know the proper forms (linear classifiers) for

the discriminant functions. We use the training samples

to learn the values of parameters of the classifier.

– They may not be optimal, but they are very simple to

implement.

Linear Discriminant Functions

Page 4: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 4

• Definition:

– A linear discriminant function is a function that is a linear combination of the components of the feature vector x :

g(x) = wTx + w0 (1)

where w is the weight vector and w0 is the bias or threshold.

• For a two-category classifier the decision rule is:

Decide 1 if g(x) > 0 and 2 if g(x) < 0

or Decide 1 if wTx > -w0 and 2 otherwise

Linear Discriminant Functions

Page 5: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 5

Linear Discriminant Functions

Page 6: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 6

The Two-Category Case:

• The equation g(x) = 0 defines the decision surface

that separates points of the category 1 from points

of the category 2 .

• When g(x) is linear, the decision surface is a

hyperplane with the vector w normal to it.

• The discriminant function g(x) gives an algebraic

measure of the distance from x to the hyperplane.

Linear Discriminant Functions

Page 7: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 7

Linear Discriminant Functions

Page 8: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 8

• In general, the hyperplane divides the feature space

into two half-spaces, decision region R1 for ω1 and

region R2 for ω2.

• It is sometimes said that any x in R1 is on the

positive side of the hyperplane, and any x in R2 is on

the negative side.

• The orientation of the surface is determined by the

normal vector w and the location of the surface is

determined by the bias w0.

Linear Discriminant Functions

Page 9: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 9

The multi-category case:

– We define c linear discriminant functions

– The decision rule:

• In this case, the classifier is a “linear machine”

– in case of ties, the classification is undefined.

c1,...,i )( 0 i

t

ii wg xwx

Linear Discriminant Functions

assign x to i if gi(x) > gj(x) j i;

Page 10: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 10

The multi-category case:

– A linear machine divides the feature space into c decision

regions, with gi(x) being the largest discriminant if x is in

the region Ri .

– The boundary that separates the two contiguous regions

Ri and Rj is a portion of a hyperplane Hij that is defined

by:

gi(x) = gj(x) (wi – wj )tx + (wi0 – wj0) = 0

– and (wi – wj) is normal to Hij and

ji

jiij

ww

xgxgHxd

)()(),(

Linear Discriminant Functions

Page 11: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 11

Linear Discriminant Functions

Page 12: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 12

• Consider the cost function J (w) that is continuously

differentiable function of unknown weights w. we need to find

the optimal solution w* that satisfies:

Unconstrained Optimization Methods

)(*)( ww JJ

0*)( wJ

• That is, solve the unconstrained-

optimization problem stated as:

“Minimize the cost function J (w)

with respect to the weight vector w”

• The necessary condition for optimality:

Page 13: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 13

• The Gradient Descent Learning method:

– Minimizing J(a) is done simply by starting with any

value for a, say a(1), then compute the gradient J(a(1)).

The next value for a, a(2), is obtained by moving a(1) in

the direction of the steepest descent.

Unconstrained Optimization Methods

Page 14: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 14

• The Newton’s Descent Learning method

– It uses the second order expansion of J(a):

– H is the Hessian matrix of J(a) evaluated at a(k).

Unconstrained Optimization Methods

Page 15: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 15

• Gradient Descent vs. Newton’s Descent

Unconstrained Optimization Methods

Page 16: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 16

• Use one of the optimization techniques to

find the parameters of the polynomial

function that fits the following data:

Assignment 5

X F(x)

1 0.53

2 0.33

3 0.13

4 -0.38

5 -0.49

6 -0.23

7 0.44

8 0.85

9 0.37

10 0.12 -0.80

-0.60

-0.40

-0.20

0.00

0.20

0.40

0.60

0.00 2.00 4.00 6.00 8.00 10.00 12.00

Page 17: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 17

• Build a linear classifier to classify the

fisher’s Iris data set.

Assignment 6

Page 18: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 18

6-1. Introduction

6-2. Feedforward Operation and Classification

6-3. Backpropagation Algorithm

Multilayer Neural Networks

Page 19: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 19

6-1. Introduction

• Linear classifiers are simple and not general enough

for demanding applications. Also, linear discriminant

functions can not solve nonlinearly separable date.

• The solution is the choice of nonlinear functions.

However, we may obtain arbitrary decisions, in

particular the one leading to minimum error.

Multilayer Neural Networks

Page 20: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 20

Feedforward Operation and Classification

– We need to learn the nonlinearity at the same

time as the linear discriminant.

• This is the approach of multilayer neural networks

(Also, Multilayer Perceptron)

– In using the Multilayer Neural Networks (MLP),

the form of the nonlinearity is learned from the

training data.

Multilayer Neural Networks

Page 21: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 21

Multilayer Neural Networks

•A single neuron is a nonlinear processing element.

d

1i

d

0i

t

jjii0jjiij ,x.wwxwwxnet

𝒐𝒋 = 𝝋(𝒏𝒆𝒕𝒋)

Page 22: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 22

• An output unit oj computes the nonlinear function of its net,

emitting :

• In the case of c outputs (classes), we can view the network

as computing c discriminant functions

oj = gj(x) and classify the input x according to the largest

discriminant function gj(x) j = 1, …, c.

Multilayer Neural Networks

)( jj neto

Page 23: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 23

• Feedforward Neural Networks:

– A neural network consists of an input layer, a

hidden layer(s) and an output layer

interconnected by modifiable weights

represented by links between layers.

– Each layer consists of an array of neurons.

– At least, three-layer network is used to learn

any nonlinear data to a degree of accuracy.

Multilayer Neural Networks

Page 24: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 24

Multilayer Neural Networks

Page 25: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 25

Multilayer Neural Networks

Page 26: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 26

Multilayer Neural Networks

Page 27: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 27

Multilayer Neural Networks

Page 28: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 28

• Expressive Power of multi-layer Networks

Question: Can every decision be implemented by a three-layer network described by equation (1) ?

Answer: Yes (due to A. Kolmogorov):

“Any continuous function from input to output can be implemented in a three-layer net, given sufficient number of hidden units nH , proper nonlinearities, and weights.”

Unfortunately: Kolmogorov’s theorem tells us very little about how to find the nonlinear functions based on data; this is the central problem in network-based pattern recognition

Multilayer Neural Networks

Page 29: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 29

• General Feedforward Operation – case of c output

units

– Hidden units enable us to express more complicated nonlinear functions

and thus extend the classification

– The activation function does not have to be a sign function, it is often

required to be continuous and differentiable

– We can allow the activation in the output layer to be different from the

activation function in the hidden layer or have different activation for

each individual unit

– We assume for now that all activation functions to be identical

c)1,...,(k

(1) )(1

0

1

0

Hn

j

k

d

i

jijikjkk wwxwfwfzxg

Multilayer Neural Networks

Page 30: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 30

Multilayer Neural Networks

Page 31: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 31

• Modes of operation:

– Feedforward:

The feedforward operations consists of presenting a

pattern to the input units and passing (or feeding) the

signals through the network in order to get outputs units.

– Backward (Learning):

The supervised learning consists of presenting an input

pattern and modifying the network parameters (weights) to

reduce distances between the computed output and the

desired output.

Multilayer Neural Networks

Page 32: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 32

• Network Learning:

– Let tk be the kth target (or desired) output and zk be the kth

computed output with k = 1, …, c and w represents all the

weights of the network

– The training error:

– The backpropagation learning rule is based on gradient

descent

• The weights are initialized with pseudo-random values and are

changed in a direction that will reduce the error:

c

1k

22

kk zt2

1)zt(

2

1)w(J

w

Jw

Multilayer Neural Networks

Page 33: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 33

The Backpropagation Learning Algorithm

– Backpropagation is one of the simplest and most general

methods for supervised training of multilayer neural

networks

– It solves the credit assignment problem: there is no

explicit teacher to state what the hidden unit’s output

should be.

– The power of Backpropagation is that it allows us to

calculate an effective error for each hidden unit, and

thus derive a learning rule for the input-to-hidden

weights.

Multilayer Neural Networks

Page 34: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 34

Multilayer Neural Networks

Page 35: CSC446: Pattern Recognition (LN8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 35

• Build a neural network to classify the

fisher’s Iris data set.

Assignment 7 (optional)