nns2_basics.ppt
TRANSCRIPT
-
7/30/2019 NNs2_Basics.ppt
1/28
Lecture 2: Basics and definitions
Networks as Data Models
-
7/30/2019 NNs2_Basics.ppt
2/28
f w x bi ij j i
j
m
( )1
Last lecture: an artificial neuron
Bias: input = 1x0 = 1
x2
x1
xm
y1
yi
w1m
w12
w11b1
-
7/30/2019 NNs2_Basics.ppt
3/28
Thus the artificial neuron is defined by the components:
1. A set of inputs, xi.
2. A set of weights, wij.3. A bias, bi.
4. An activation function, f.
5. Neuron output, y
The subscript i indicates the i-th input or weight.
As the inputs and output are external, the
parameters of this model are therefore theweights, bias and activation function and thus
DEFINE the model
http://localhost/var/www/apps/conversion/tmp/scratch_7/My%20Documents/downloads/teach/artificialneuron/aneuronPrg/html/components.htmlhttp://localhost/var/www/apps/conversion/tmp/scratch_7/My%20Documents/downloads/teach/artificialneuron/aneuronPrg/html/components.htmlhttp://localhost/var/www/apps/conversion/tmp/scratch_7/My%20Documents/downloads/teach/artificialneuron/aneuronPrg/html/components.htmlhttp://localhost/var/www/apps/conversion/tmp/scratch_7/My%20Documents/downloads/teach/artificialneuron/aneuronPrg/html/components.htmlhttp://localhost/var/www/apps/conversion/tmp/scratch_7/My%20Documents/downloads/teach/artificialneuron/aneuronPrg/html/components.htmlhttp://localhost/var/www/apps/conversion/tmp/scratch_7/My%20Documents/downloads/teach/artificialneuron/aneuronPrg/html/components.html -
7/30/2019 NNs2_Basics.ppt
4/28
xn
x1
x2
Input
(visual
input)
Output
(Motoroutput)
More layers means more of the same parameters (and several subscripts)
Hidden layers
-
7/30/2019 NNs2_Basics.ppt
5/28
Network as a data model
Can view a network as a model which has a set ofparameters associated with it
Networks transform input data into an output
Transformation is defined by the networkparameters
Parameters set/adapted by optimisation/adaptiveprocedure: learning (Haykin, 99)
Idea is that given a set of data points network(model) can be trained so as to generalise
-
7/30/2019 NNs2_Basics.ppt
6/28
NNs for function approximation
that is, network learns a (correct) mappingfrom inputs to outputs
Thus NNs can be seen as being a multivariate
non-linear mapping and are often used forfunction approximation
2 main categories:
Classification: given an input say which class it is inRegression: given an input what is the expected
output
-
7/30/2019 NNs2_Basics.ppt
7/28
Mapping/function needs to be learnt: various methods available
Learning process used shapes final solution
Supervised learning: have a teacher, telling you where to go
Unsupervised learning: no teacher, net learns by itself
Reinforcement learning: have a critic, wrong or correct
Type of learning used depends on task at hand. We will deal mainly
with supervised and unsupervised learning. Reinforcement learning
will be taught in Adaptive Systems course or can be found in eg
Haykin or Hertz et al. or
Sutton R.S., and Barto A.G. (1998):Reinforcement learning: an
introduction MIT Press
LEARNING: extracting principles from data
-
7/30/2019 NNs2_Basics.ppt
8/28
Pattern: the opposite of chaos; it is an entity, vaguelydefined, that could be given a name or a classification
Examples:
Fingerprints,
Handwritten characters,
Human face,
Speech (or deer/whale/bat etc) signals, Iris patterns
Medical imaging (various screening procedures)
Remote sensing etc etc etc.
Pattern recognition
-
7/30/2019 NNs2_Basics.ppt
9/28
Given a pattern:
a. supervised classification (discriminantanalysis) in which the input pattern is identified as a
member of a predefined class
b. unsupervised classification (e.g. clustering ) in
which the patter is assigned to a hitherto unknown
class.
Unsupervised methods will be discussed further in
future lectures
-
7/30/2019 NNs2_Basics.ppt
10/28
Eg Handwritten digit
classification:First need a data set to learn from: sets of characters
How are they represented? Eg as an input vector x = (x1, , xn) to
the network (eg vector of ones and zeroes for each pixel according to
whether it is black/white).
Set of input vectors is ourTraining Set X which has already been
classified into as and bs (note capitals for set , X, underlined small
letters for an instance of set, xiie the ith training pattern/vector)
Given a training set X, our goal is to tell if a new image is an a or b
ie classify it into one of 2 classes C1 (all as) or C2(all bs) (in
general one of k classes C1.. Ck)
a b
-
7/30/2019 NNs2_Basics.ppt
11/28
Generalisation
Q. How do we tell if a new unseen image is an a or b?A. Brute force: have a library of all possible images
But 256 x 256 pixels => 2256 x 256 = 10158,000 images
Impossible! Typically have less than a few thousand images intraining set
Therefore, system must be able to classify UNSEEN patterns from
the patterns it has seen
I.e. Must be able to generalise from the data in the training set
Intuition: real neural networks do this well, so maybe artificial ones
can do the same. As they are also shaped by experiences maybe
well also learn about how the brain does it ...
-
7/30/2019 NNs2_Basics.ppt
12/28
For 2 class classification we want the network
output y (a function of the inputs and network
parameters) to be:
y(x, w) = 1 if x is an a
y(x, w) = -1 if x is a b
where x is an input vector and the network
parameters are grouped as a vector w.
y is known as a discriminant function: itdiscriminates between 2 classes
-
7/30/2019 NNs2_Basics.ppt
13/28
As the network mapping is defined by the parameters we must use the
data set to perform Learning (training, adaptation) ie:
change weights or interaction between neurons according to thetraining examples (and possibly prior knowledge of the problem)
Where the purpose of learning is to minimize:
training errors on learning data: learning error
prediction errors on new, unseen data: generalization error
Since when the errors are minimised, the network discriminates
between the 2 classes
We therefore need an error function to measure the network
performance based on the training error
An optimisation algorithms can then be used to minimise the learning
errors and train the network
-
7/30/2019 NNs2_Basics.ppt
14/28
Feature ExtractionHowever, if we use all the pixels as inputs we are going to have a
long training procedure and a very big networkMay want to analyse the data first (pre-process it) and extract some
(lower dimensional) salient features to be the inputs to the network
xx*
feature
extraction
pattern space
(data)
feature
space
Could use the ratio of
height and width of letter
as bs will tend to be
higher than as (Prior
knowledge)
Also, scale invariant
-
7/30/2019 NNs2_Basics.ppt
15/28
Could then make a decision based on this feature. Suppose
we make a histogram of the values of x* for the input
vectors in the training set X
x*
C1 C2
A
For a new input with an x* value of A we wouldclassify it as C1 as it is more likely to belong to
this class
-
7/30/2019 NNs2_Basics.ppt
16/28
Therefore, get the idea of a Decision Boundary
Points on one side of the boundary are in one class, and on
the other are in the other class ie
if x* < d pattern is in C1 else it is in C2
Intuitively it makes sense (and is optimal in a Bayesian
sense) to place it where the 2 histograms cross
x*
C1 C2
A
Decision Boundary
x* = d
-
7/30/2019 NNs2_Basics.ppt
17/28
Can then view pattern recognition as the process of assigning patterns
to one of a number of classes by dividing up the feature space with
decision boundaries, which thus divides the original space
xy
feature
extraction
classification
Decision space
pattern space(data)
feature
space
-
7/30/2019 NNs2_Basics.ppt
18/28
However, can be lots of overlap in this case so could use a
rejection threshold e where
if x* < d - e pattern is in C1
if x* > d + e pattern is in C2
else use refer to a better/different classifier
Related to the idea of minimising Risk where it may be more
important to not misclassify in one class rather than the other
Especially important in medical applications. Can serve to shift the
decision boundary one way or the other based on the Loss function
which defines the relative importance/cost of the different errors
x*
C1 C2
A?
-
7/30/2019 NNs2_Basics.ppt
19/28
Alternatively can use more features
x
x
x
x
x
x1*
x2*
x x
However, cannot keep increasing number of features as there will come
a point where the performance starts to degrade as there is not enough
data to provide a good estimate (cf using 256 x256 pixels)
Here, use of any one
feature leads tosignificant overlap
(imagine projections
onto the axes) but use
of both gives a good
separation
-
7/30/2019 NNs2_Basics.ppt
20/28
Curse of dimensionality Geometric eg: suppose we want to approximate a 1d function y
from m-dimensional training data. We could: divide each dimension into intervals (like histogram)
y value for interval = mean y value of all points in the interval
Increase precision by increasing number of intervals
However, need at least1 point in each interval For k intervals in each dimension need > km data points
Thus number of data points grows at least exponentially withthe input dimension
Known as the Curse of Dimensionality:
A function defined in high dimensional space is likely to bemuch more complex than a function defined in a lowerdimensional space and those complications are harder to discern(Friedman 95, in Haykin, 99)
-
7/30/2019 NNs2_Basics.ppt
21/28
Of course, the above is a particularly inefficient way of using
data and most NNs are less susceptible
However, onlypractical way to beat the curse is toincorporate correct prior knowledge
In practice, we must make the underlying function smoother
(ie less complex) with increasing input dimensionality
Also try to reduce the input dimension by pre-processing
Mainly, learn to live with the fact that perfect
performance is not possible: data in the real worldsometimes overlaps. Treat input data as random
variables and instead look for a model which has smallest
probability of making a mistake
-
7/30/2019 NNs2_Basics.ppt
22/28
Multivariate regressionType of function approximation: try to approximate a function from
a set of (noisy) training data. Eg suppose we have the function:
y = 0.5 +0.4 sin(2px),
We generate training data at equal intervals of x and add a little
ransom Gaussian noise with s.d. 0.05.
We add noise since in practical applications data will inevitably be
noisy
We then test the model by plugging in many values of x and viewingthe resultant function. This gives an idea of the Generalisation
peformance of the model
-
7/30/2019 NNs2_Basics.ppt
23/28
Eg: suppose we have the function: y = 0.5 +0.4 sin(2px),
We generate training data at equal intervals of x (red circles) and
add a little random Gaussian noise with s.d. 0.05 and the model The
model is trained on this data
-
7/30/2019 NNs2_Basics.ppt
24/28
This gives an idea of the Generalisation peformance of the model
We then test the model (in this case a piecewise linear model) by
plugging in many values of x and viewing the resultant function
(solid blue line)
-
7/30/2019 NNs2_Basics.ppt
25/28
Model Complexity
In the previous picture used a piecwise linear function to
approximate the data. Better to use a polynomial y = Saixi to
approximate the data ie:
y = a0 + a1x 1st order (straight line)
y = a0 + a1x + a2x2 2nd order (quadratic)
y = a0 + a1x + a2x2 + a3x
3 3rd order
y = a0 + a1x + a2x2 + a3x
3+ + anxn nth order
As the order (highest power of x) increases, so does the potentialcomplexity of the model/polynomial
This means that it can represent a more complex (non-smooth)
function and thus approximate the data more accurately
-
7/30/2019 NNs2_Basics.ppt
26/28
1st order: model too simple
10th order: more accurate interms of passing thru data
points but is too complex and
non-smooth (curvy)
3rd order: models
underlying function well
-
7/30/2019 NNs2_Basics.ppt
27/28
Note though that training error continues to go down as model
matches the fine-scale detail of the data (ie the noise)
Rather want to model the intrinsic dimensionalityof the data
otherwise get the problem ofoverfitting
Analagous to the problem of overtraining where a model is trained
for too long and models the data too exactly and loses its generality
As the model
complexity grows
performance improvesfor a while but starts
to degrade sfter
reaching an optimal
level
-
7/30/2019 NNs2_Basics.ppt
28/28
Similar problems occur in classification problems: A model with too
much flexibility does not genralise well resulting in a non-smooth
decision boundary.
Somewhat like giving a system enough capacity to remember all
training point: no need to generalise. Less memory => it mustgeneralise to be able to model training data
Trade-off between being a good fit to the training data and achieving a
good generalisation: cf Bias Variance trade off (later)