csc446: pattern recognition (ln8)
TRANSCRIPT
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 1
Lecture Note 8:
Ch5: Linear Discriminant Functions
Ch6: Multilayer Neural Networks
CSC446 : Pattern Recognition
Prof. Dr. Mostafa Gadal-Haqq
Faculty of Computer & Information Sciences
Computer Science Department
AIN SHAMS UNIVERSITY
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 2
1. Introduction
2. Linear Classifier & Decision Surface
2-1. The Two-Category Case
2-2. The Multi-Category Case
3. Unconstrained Optimization methods:
3-1. The Gradient Descent method
3-2. The Newton’s Descent method
Linear Discriminant Functions
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 3
• In Bayesian Decision theory:
– the underlying probability densities were known (or
given). The training sample is used to estimate the
parameters of these probability densities.
• In Linear Discriminant functions:
– we only know the proper forms (linear classifiers) for
the discriminant functions. We use the training samples
to learn the values of parameters of the classifier.
– They may not be optimal, but they are very simple to
implement.
Linear Discriminant Functions
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 4
• Definition:
– A linear discriminant function is a function that is a linear combination of the components of the feature vector x :
g(x) = wTx + w0 (1)
where w is the weight vector and w0 is the bias or threshold.
• For a two-category classifier the decision rule is:
Decide 1 if g(x) > 0 and 2 if g(x) < 0
or Decide 1 if wTx > -w0 and 2 otherwise
Linear Discriminant Functions
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 5
Linear Discriminant Functions
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 6
The Two-Category Case:
• The equation g(x) = 0 defines the decision surface
that separates points of the category 1 from points
of the category 2 .
• When g(x) is linear, the decision surface is a
hyperplane with the vector w normal to it.
• The discriminant function g(x) gives an algebraic
measure of the distance from x to the hyperplane.
Linear Discriminant Functions
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 7
Linear Discriminant Functions
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 8
• In general, the hyperplane divides the feature space
into two half-spaces, decision region R1 for ω1 and
region R2 for ω2.
• It is sometimes said that any x in R1 is on the
positive side of the hyperplane, and any x in R2 is on
the negative side.
• The orientation of the surface is determined by the
normal vector w and the location of the surface is
determined by the bias w0.
Linear Discriminant Functions
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 9
The multi-category case:
– We define c linear discriminant functions
– The decision rule:
• In this case, the classifier is a “linear machine”
– in case of ties, the classification is undefined.
c1,...,i )( 0 i
t
ii wg xwx
Linear Discriminant Functions
assign x to i if gi(x) > gj(x) j i;
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 10
The multi-category case:
– A linear machine divides the feature space into c decision
regions, with gi(x) being the largest discriminant if x is in
the region Ri .
– The boundary that separates the two contiguous regions
Ri and Rj is a portion of a hyperplane Hij that is defined
by:
gi(x) = gj(x) (wi – wj )tx + (wi0 – wj0) = 0
– and (wi – wj) is normal to Hij and
ji
jiij
ww
xgxgHxd
)()(),(
Linear Discriminant Functions
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 11
Linear Discriminant Functions
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 12
• Consider the cost function J (w) that is continuously
differentiable function of unknown weights w. we need to find
the optimal solution w* that satisfies:
Unconstrained Optimization Methods
)(*)( ww JJ
0*)( wJ
• That is, solve the unconstrained-
optimization problem stated as:
“Minimize the cost function J (w)
with respect to the weight vector w”
• The necessary condition for optimality:
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 13
• The Gradient Descent Learning method:
– Minimizing J(a) is done simply by starting with any
value for a, say a(1), then compute the gradient J(a(1)).
The next value for a, a(2), is obtained by moving a(1) in
the direction of the steepest descent.
Unconstrained Optimization Methods
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 14
• The Newton’s Descent Learning method
– It uses the second order expansion of J(a):
– H is the Hessian matrix of J(a) evaluated at a(k).
Unconstrained Optimization Methods
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 15
• Gradient Descent vs. Newton’s Descent
Unconstrained Optimization Methods
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 16
• Use one of the optimization techniques to
find the parameters of the polynomial
function that fits the following data:
Assignment 5
X F(x)
1 0.53
2 0.33
3 0.13
4 -0.38
5 -0.49
6 -0.23
7 0.44
8 0.85
9 0.37
10 0.12 -0.80
-0.60
-0.40
-0.20
0.00
0.20
0.40
0.60
0.00 2.00 4.00 6.00 8.00 10.00 12.00
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 17
• Build a linear classifier to classify the
fisher’s Iris data set.
Assignment 6
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 18
6-1. Introduction
6-2. Feedforward Operation and Classification
6-3. Backpropagation Algorithm
Multilayer Neural Networks
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 19
6-1. Introduction
• Linear classifiers are simple and not general enough
for demanding applications. Also, linear discriminant
functions can not solve nonlinearly separable date.
• The solution is the choice of nonlinear functions.
However, we may obtain arbitrary decisions, in
particular the one leading to minimum error.
Multilayer Neural Networks
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 20
Feedforward Operation and Classification
– We need to learn the nonlinearity at the same
time as the linear discriminant.
• This is the approach of multilayer neural networks
(Also, Multilayer Perceptron)
– In using the Multilayer Neural Networks (MLP),
the form of the nonlinearity is learned from the
training data.
Multilayer Neural Networks
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 21
Multilayer Neural Networks
•A single neuron is a nonlinear processing element.
d
1i
d
0i
t
jjii0jjiij ,x.wwxwwxnet
𝒐𝒋 = 𝝋(𝒏𝒆𝒕𝒋)
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 22
• An output unit oj computes the nonlinear function of its net,
emitting :
• In the case of c outputs (classes), we can view the network
as computing c discriminant functions
oj = gj(x) and classify the input x according to the largest
discriminant function gj(x) j = 1, …, c.
Multilayer Neural Networks
)( jj neto
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 23
• Feedforward Neural Networks:
– A neural network consists of an input layer, a
hidden layer(s) and an output layer
interconnected by modifiable weights
represented by links between layers.
– Each layer consists of an array of neurons.
– At least, three-layer network is used to learn
any nonlinear data to a degree of accuracy.
Multilayer Neural Networks
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 24
Multilayer Neural Networks
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 25
Multilayer Neural Networks
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 26
Multilayer Neural Networks
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 27
Multilayer Neural Networks
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 28
• Expressive Power of multi-layer Networks
Question: Can every decision be implemented by a three-layer network described by equation (1) ?
Answer: Yes (due to A. Kolmogorov):
“Any continuous function from input to output can be implemented in a three-layer net, given sufficient number of hidden units nH , proper nonlinearities, and weights.”
Unfortunately: Kolmogorov’s theorem tells us very little about how to find the nonlinear functions based on data; this is the central problem in network-based pattern recognition
Multilayer Neural Networks
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 29
• General Feedforward Operation – case of c output
units
– Hidden units enable us to express more complicated nonlinear functions
and thus extend the classification
– The activation function does not have to be a sign function, it is often
required to be continuous and differentiable
– We can allow the activation in the output layer to be different from the
activation function in the hidden layer or have different activation for
each individual unit
– We assume for now that all activation functions to be identical
c)1,...,(k
(1) )(1
0
1
0
Hn
j
k
d
i
jijikjkk wwxwfwfzxg
Multilayer Neural Networks
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 30
Multilayer Neural Networks
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 31
• Modes of operation:
– Feedforward:
The feedforward operations consists of presenting a
pattern to the input units and passing (or feeding) the
signals through the network in order to get outputs units.
– Backward (Learning):
The supervised learning consists of presenting an input
pattern and modifying the network parameters (weights) to
reduce distances between the computed output and the
desired output.
Multilayer Neural Networks
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 32
• Network Learning:
– Let tk be the kth target (or desired) output and zk be the kth
computed output with k = 1, …, c and w represents all the
weights of the network
– The training error:
– The backpropagation learning rule is based on gradient
descent
• The weights are initialized with pseudo-random values and are
changed in a direction that will reduce the error:
c
1k
22
kk zt2
1)zt(
2
1)w(J
w
Jw
Multilayer Neural Networks
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 33
The Backpropagation Learning Algorithm
– Backpropagation is one of the simplest and most general
methods for supervised training of multilayer neural
networks
– It solves the credit assignment problem: there is no
explicit teacher to state what the hidden unit’s output
should be.
– The power of Backpropagation is that it allows us to
calculate an effective error for each hidden unit, and
thus derive a learning rule for the input-to-hidden
weights.
Multilayer Neural Networks
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 34
Multilayer Neural Networks
ASU-CSC446 : Pattern Recognition. Prof. Dr. Mostafa Gadal-Haqq slide - 35
• Build a neural network to classify the
fisher’s Iris data set.
Assignment 7 (optional)