csc446: pattern recognition (ln8)

ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1

Lecture Note 8:

Ch5: Linear Discriminant Functions

Ch6: Multilayer Neural Networks

CSC446 : Pattern Recognition

Prof. Dr. Mostafa Gadal-Haqq

Faculty of Computer & Information Sciences

Computer Science Department

AIN SHAMS UNIVERSITY


1. Introduction

2. Linear Classifier & Decision Surface

2-1. The Two-Category Case

2-2. The Multi-Category Case

3. Unconstrained Optimization methods:

3-1. The Gradient Descent method

3-2. The Newton’s Descent method

Linear Discriminant Functions


• In Bayesian Decision theory:

– the underlying probability densities were known (or

given). The training sample is used to estimate the

parameters of these probability densities.

• In Linear Discriminant functions:

– we only know the proper forms (linear classifiers) for

the discriminant functions. We use the training samples

to learn the values of parameters of the classifier.

– They may not be optimal, but they are very simple to

implement.



• Definition:

– A linear discriminant function is a function that is a linear combination of the components of the feature vector x :

g(x) = wTx + w0 (1)

where w is the weight vector and w0 is the bias or threshold.

• For a two-category classifier the decision rule is:

Decide 1 if g(x) > 0 and 2 if g(x) < 0

or Decide 1 if wTx > -w0 and 2 otherwise



The Two-Category Case:

• The equation g(x) = 0 defines the decision surface

that separates points of the category 1 from points

of the category 2 .

• When g(x) is linear, the decision surface is a

hyperplane with the vector w normal to it.

• The discriminant function g(x) gives an algebraic

measure of the distance from x to the hyperplane.



• In general, the hyperplane divides the feature space

into two half-spaces, decision region R1 for ω1 and

region R2 for ω2.

• It is sometimes said that any x in R1 is on the

positive side of the hyperplane, and any x in R2 is on

the negative side.

• The orientation of the surface is determined by the

normal vector w and the location of the surface is

determined by the bias w0.



The multi-category case:

– We define c linear discriminant functions

– The decision rule:

• In this case, the classifier is a “linear machine”

– in case of ties, the classification is undefined.

c1,...,i )( 0 i

t

ii wg xwx


assign x to i if gi(x) > gj(x) j i;


The multi-category case:

– A linear machine divides the feature space into c decision

regions, with gi(x) being the largest discriminant if x is in

the region Ri .

– The boundary that separates the two contiguous regions

Ri and Rj is a portion of a hyperplane Hij that is defined

by:

gi(x) = gj(x) (wi – wj )tx + (wi0 – wj0) = 0

– and (wi – wj) is normal to Hij and

ji

jiij

ww

xgxgHxd

)()(),(



• Consider the cost function J (w) that is continuously

differentiable function of unknown weights w. we need to find

the optimal solution w* that satisfies:

Unconstrained Optimization Methods

)(*)( ww JJ

0*)( wJ

• That is, solve the unconstrained-

optimization problem stated as:

“Minimize the cost function J (w)

with respect to the weight vector w”

• The necessary condition for optimality:


• The Gradient Descent Learning method:

– Minimizing J(a) is done simply by starting with any

value for a, say a(1), then compute the gradient J(a(1)).

The next value for a, a(2), is obtained by moving a(1) in

the direction of the steepest descent.



• The Newton’s Descent Learning method

– It uses the second order expansion of J(a):

– H is the Hessian matrix of J(a) evaluated at a(k).



• Gradient Descent vs. Newton’s Descent



• Use one of the optimization techniques to

find the parameters of the polynomial

function that fits the following data:

Assignment 5

X F(x)

1 0.53

2 0.33

3 0.13

4 -0.38

5 -0.49

6 -0.23

7 0.44

8 0.85

9 0.37

10 0.12 -0.80

-0.60

-0.40

-0.20

0.00

0.20

0.40

0.60

0.00 2.00 4.00 6.00 8.00 10.00 12.00


• Build a linear classifier to classify the

fisher’s Iris data set.

Assignment 6


6-1. Introduction

6-2. Feedforward Operation and Classification

6-3. Backpropagation Algorithm

Multilayer Neural Networks


6-1. Introduction

• Linear classifiers are simple and not general enough

for demanding applications. Also, linear discriminant

functions can not solve nonlinearly separable date.

• The solution is the choice of nonlinear functions.

However, we may obtain arbitrary decisions, in

particular the one leading to minimum error.



Feedforward Operation and Classification

– We need to learn the nonlinearity at the same

time as the linear discriminant.

• This is the approach of multilayer neural networks

(Also, Multilayer Perceptron)

– In using the Multilayer Neural Networks (MLP),

the form of the nonlinearity is learned from the

training data.




•A single neuron is a nonlinear processing element.

d

1i

d

0i

t

jjii0jjiij ,x.wwxwwxnet

𝒐𝒋 = 𝝋(𝒏𝒆𝒕𝒋)


• An output unit oj computes the nonlinear function of its net,

emitting :

• In the case of c outputs (classes), we can view the network

as computing c discriminant functions

oj = gj(x) and classify the input x according to the largest

discriminant function gj(x) j = 1, …, c.


)( jj neto


• Feedforward Neural Networks:

– A neural network consists of an input layer, a

hidden layer(s) and an output layer

interconnected by modifiable weights

represented by links between layers.

– Each layer consists of an array of neurons.

– At least, three-layer network is used to learn

any nonlinear data to a degree of accuracy.



• Expressive Power of multi-layer Networks

Question: Can every decision be implemented by a three-layer network described by equation (1) ?

Answer: Yes (due to A. Kolmogorov):

“Any continuous function from input to output can be implemented in a three-layer net, given sufficient number of hidden units nH , proper nonlinearities, and weights.”

Unfortunately: Kolmogorov’s theorem tells us very little about how to find the nonlinear functions based on data; this is the central problem in network-based pattern recognition



• General Feedforward Operation – case of c output

units

– Hidden units enable us to express more complicated nonlinear functions

and thus extend the classification

– The activation function does not have to be a sign function, it is often

required to be continuous and differentiable

– We can allow the activation in the output layer to be different from the

activation function in the hidden layer or have different activation for

each individual unit

– We assume for now that all activation functions to be identical

c)1,...,(k

(1) )(1

0

1

0

Hn

j

k

d

i

jijikjkk wwxwfwfzxg



• Modes of operation:

– Feedforward:

The feedforward operations consists of presenting a

pattern to the input units and passing (or feeding) the

signals through the network in order to get outputs units.

– Backward (Learning):

The supervised learning consists of presenting an input

pattern and modifying the network parameters (weights) to

reduce distances between the computed output and the

desired output.



• Network Learning:

– Let tk be the kth target (or desired) output and zk be the kth

computed output with k = 1, …, c and w represents all the

weights of the network

– The training error:

– The backpropagation learning rule is based on gradient

descent

• The weights are initialized with pseudo-random values and are

changed in a direction that will reduce the error:

c

1k

22

kk zt2

1)zt(

2

1)w(J

w

Jw



The Backpropagation Learning Algorithm

– Backpropagation is one of the simplest and most general

methods for supervised training of multilayer neural

networks

– It solves the credit assignment problem: there is no

explicit teacher to state what the hidden unit’s output

should be.

– The power of Backpropagation is that it allows us to

calculate an effective error for each hidden unit, and

thus derive a learning rule for the input-to-hidden

weights.



• Build a neural network to classify the

fisher’s Iris data set.

Assignment 7 (optional)

csc446: pattern recognition (ln8)

Data & Analytics