nns2_basics.ppt

Upload: shardapatel

Post on 14-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 NNs2_Basics.ppt

    1/28

    Lecture 2: Basics and definitions

    Networks as Data Models

  • 7/30/2019 NNs2_Basics.ppt

    2/28

    f w x bi ij j i

    j

    m

    ( )1

    Last lecture: an artificial neuron

    Bias: input = 1x0 = 1

    x2

    x1

    xm

    y1

    yi

    w1m

    w12

    w11b1

  • 7/30/2019 NNs2_Basics.ppt

    3/28

    Thus the artificial neuron is defined by the components:

    1. A set of inputs, xi.

    2. A set of weights, wij.3. A bias, bi.

    4. An activation function, f.

    5. Neuron output, y

    The subscript i indicates the i-th input or weight.

    As the inputs and output are external, the

    parameters of this model are therefore theweights, bias and activation function and thus

    DEFINE the model

    http://localhost/var/www/apps/conversion/tmp/scratch_7/My%20Documents/downloads/teach/artificialneuron/aneuronPrg/html/components.htmlhttp://localhost/var/www/apps/conversion/tmp/scratch_7/My%20Documents/downloads/teach/artificialneuron/aneuronPrg/html/components.htmlhttp://localhost/var/www/apps/conversion/tmp/scratch_7/My%20Documents/downloads/teach/artificialneuron/aneuronPrg/html/components.htmlhttp://localhost/var/www/apps/conversion/tmp/scratch_7/My%20Documents/downloads/teach/artificialneuron/aneuronPrg/html/components.htmlhttp://localhost/var/www/apps/conversion/tmp/scratch_7/My%20Documents/downloads/teach/artificialneuron/aneuronPrg/html/components.htmlhttp://localhost/var/www/apps/conversion/tmp/scratch_7/My%20Documents/downloads/teach/artificialneuron/aneuronPrg/html/components.html
  • 7/30/2019 NNs2_Basics.ppt

    4/28

    xn

    x1

    x2

    Input

    (visual

    input)

    Output

    (Motoroutput)

    More layers means more of the same parameters (and several subscripts)

    Hidden layers

  • 7/30/2019 NNs2_Basics.ppt

    5/28

    Network as a data model

    Can view a network as a model which has a set ofparameters associated with it

    Networks transform input data into an output

    Transformation is defined by the networkparameters

    Parameters set/adapted by optimisation/adaptiveprocedure: learning (Haykin, 99)

    Idea is that given a set of data points network(model) can be trained so as to generalise

  • 7/30/2019 NNs2_Basics.ppt

    6/28

    NNs for function approximation

    that is, network learns a (correct) mappingfrom inputs to outputs

    Thus NNs can be seen as being a multivariate

    non-linear mapping and are often used forfunction approximation

    2 main categories:

    Classification: given an input say which class it is inRegression: given an input what is the expected

    output

  • 7/30/2019 NNs2_Basics.ppt

    7/28

    Mapping/function needs to be learnt: various methods available

    Learning process used shapes final solution

    Supervised learning: have a teacher, telling you where to go

    Unsupervised learning: no teacher, net learns by itself

    Reinforcement learning: have a critic, wrong or correct

    Type of learning used depends on task at hand. We will deal mainly

    with supervised and unsupervised learning. Reinforcement learning

    will be taught in Adaptive Systems course or can be found in eg

    Haykin or Hertz et al. or

    Sutton R.S., and Barto A.G. (1998):Reinforcement learning: an

    introduction MIT Press

    LEARNING: extracting principles from data

  • 7/30/2019 NNs2_Basics.ppt

    8/28

    Pattern: the opposite of chaos; it is an entity, vaguelydefined, that could be given a name or a classification

    Examples:

    Fingerprints,

    Handwritten characters,

    Human face,

    Speech (or deer/whale/bat etc) signals, Iris patterns

    Medical imaging (various screening procedures)

    Remote sensing etc etc etc.

    Pattern recognition

  • 7/30/2019 NNs2_Basics.ppt

    9/28

    Given a pattern:

    a. supervised classification (discriminantanalysis) in which the input pattern is identified as a

    member of a predefined class

    b. unsupervised classification (e.g. clustering ) in

    which the patter is assigned to a hitherto unknown

    class.

    Unsupervised methods will be discussed further in

    future lectures

  • 7/30/2019 NNs2_Basics.ppt

    10/28

    Eg Handwritten digit

    classification:First need a data set to learn from: sets of characters

    How are they represented? Eg as an input vector x = (x1, , xn) to

    the network (eg vector of ones and zeroes for each pixel according to

    whether it is black/white).

    Set of input vectors is ourTraining Set X which has already been

    classified into as and bs (note capitals for set , X, underlined small

    letters for an instance of set, xiie the ith training pattern/vector)

    Given a training set X, our goal is to tell if a new image is an a or b

    ie classify it into one of 2 classes C1 (all as) or C2(all bs) (in

    general one of k classes C1.. Ck)

    a b

  • 7/30/2019 NNs2_Basics.ppt

    11/28

    Generalisation

    Q. How do we tell if a new unseen image is an a or b?A. Brute force: have a library of all possible images

    But 256 x 256 pixels => 2256 x 256 = 10158,000 images

    Impossible! Typically have less than a few thousand images intraining set

    Therefore, system must be able to classify UNSEEN patterns from

    the patterns it has seen

    I.e. Must be able to generalise from the data in the training set

    Intuition: real neural networks do this well, so maybe artificial ones

    can do the same. As they are also shaped by experiences maybe

    well also learn about how the brain does it ...

  • 7/30/2019 NNs2_Basics.ppt

    12/28

    For 2 class classification we want the network

    output y (a function of the inputs and network

    parameters) to be:

    y(x, w) = 1 if x is an a

    y(x, w) = -1 if x is a b

    where x is an input vector and the network

    parameters are grouped as a vector w.

    y is known as a discriminant function: itdiscriminates between 2 classes

  • 7/30/2019 NNs2_Basics.ppt

    13/28

    As the network mapping is defined by the parameters we must use the

    data set to perform Learning (training, adaptation) ie:

    change weights or interaction between neurons according to thetraining examples (and possibly prior knowledge of the problem)

    Where the purpose of learning is to minimize:

    training errors on learning data: learning error

    prediction errors on new, unseen data: generalization error

    Since when the errors are minimised, the network discriminates

    between the 2 classes

    We therefore need an error function to measure the network

    performance based on the training error

    An optimisation algorithms can then be used to minimise the learning

    errors and train the network

  • 7/30/2019 NNs2_Basics.ppt

    14/28

    Feature ExtractionHowever, if we use all the pixels as inputs we are going to have a

    long training procedure and a very big networkMay want to analyse the data first (pre-process it) and extract some

    (lower dimensional) salient features to be the inputs to the network

    xx*

    feature

    extraction

    pattern space

    (data)

    feature

    space

    Could use the ratio of

    height and width of letter

    as bs will tend to be

    higher than as (Prior

    knowledge)

    Also, scale invariant

  • 7/30/2019 NNs2_Basics.ppt

    15/28

    Could then make a decision based on this feature. Suppose

    we make a histogram of the values of x* for the input

    vectors in the training set X

    x*

    C1 C2

    A

    For a new input with an x* value of A we wouldclassify it as C1 as it is more likely to belong to

    this class

  • 7/30/2019 NNs2_Basics.ppt

    16/28

    Therefore, get the idea of a Decision Boundary

    Points on one side of the boundary are in one class, and on

    the other are in the other class ie

    if x* < d pattern is in C1 else it is in C2

    Intuitively it makes sense (and is optimal in a Bayesian

    sense) to place it where the 2 histograms cross

    x*

    C1 C2

    A

    Decision Boundary

    x* = d

  • 7/30/2019 NNs2_Basics.ppt

    17/28

    Can then view pattern recognition as the process of assigning patterns

    to one of a number of classes by dividing up the feature space with

    decision boundaries, which thus divides the original space

    xy

    feature

    extraction

    classification

    Decision space

    pattern space(data)

    feature

    space

  • 7/30/2019 NNs2_Basics.ppt

    18/28

    However, can be lots of overlap in this case so could use a

    rejection threshold e where

    if x* < d - e pattern is in C1

    if x* > d + e pattern is in C2

    else use refer to a better/different classifier

    Related to the idea of minimising Risk where it may be more

    important to not misclassify in one class rather than the other

    Especially important in medical applications. Can serve to shift the

    decision boundary one way or the other based on the Loss function

    which defines the relative importance/cost of the different errors

    x*

    C1 C2

    A?

  • 7/30/2019 NNs2_Basics.ppt

    19/28

    Alternatively can use more features

    x

    x

    x

    x

    x

    x1*

    x2*

    x x

    However, cannot keep increasing number of features as there will come

    a point where the performance starts to degrade as there is not enough

    data to provide a good estimate (cf using 256 x256 pixels)

    Here, use of any one

    feature leads tosignificant overlap

    (imagine projections

    onto the axes) but use

    of both gives a good

    separation

  • 7/30/2019 NNs2_Basics.ppt

    20/28

    Curse of dimensionality Geometric eg: suppose we want to approximate a 1d function y

    from m-dimensional training data. We could: divide each dimension into intervals (like histogram)

    y value for interval = mean y value of all points in the interval

    Increase precision by increasing number of intervals

    However, need at least1 point in each interval For k intervals in each dimension need > km data points

    Thus number of data points grows at least exponentially withthe input dimension

    Known as the Curse of Dimensionality:

    A function defined in high dimensional space is likely to bemuch more complex than a function defined in a lowerdimensional space and those complications are harder to discern(Friedman 95, in Haykin, 99)

  • 7/30/2019 NNs2_Basics.ppt

    21/28

    Of course, the above is a particularly inefficient way of using

    data and most NNs are less susceptible

    However, onlypractical way to beat the curse is toincorporate correct prior knowledge

    In practice, we must make the underlying function smoother

    (ie less complex) with increasing input dimensionality

    Also try to reduce the input dimension by pre-processing

    Mainly, learn to live with the fact that perfect

    performance is not possible: data in the real worldsometimes overlaps. Treat input data as random

    variables and instead look for a model which has smallest

    probability of making a mistake

  • 7/30/2019 NNs2_Basics.ppt

    22/28

    Multivariate regressionType of function approximation: try to approximate a function from

    a set of (noisy) training data. Eg suppose we have the function:

    y = 0.5 +0.4 sin(2px),

    We generate training data at equal intervals of x and add a little

    ransom Gaussian noise with s.d. 0.05.

    We add noise since in practical applications data will inevitably be

    noisy

    We then test the model by plugging in many values of x and viewingthe resultant function. This gives an idea of the Generalisation

    peformance of the model

  • 7/30/2019 NNs2_Basics.ppt

    23/28

    Eg: suppose we have the function: y = 0.5 +0.4 sin(2px),

    We generate training data at equal intervals of x (red circles) and

    add a little random Gaussian noise with s.d. 0.05 and the model The

    model is trained on this data

  • 7/30/2019 NNs2_Basics.ppt

    24/28

    This gives an idea of the Generalisation peformance of the model

    We then test the model (in this case a piecewise linear model) by

    plugging in many values of x and viewing the resultant function

    (solid blue line)

  • 7/30/2019 NNs2_Basics.ppt

    25/28

    Model Complexity

    In the previous picture used a piecwise linear function to

    approximate the data. Better to use a polynomial y = Saixi to

    approximate the data ie:

    y = a0 + a1x 1st order (straight line)

    y = a0 + a1x + a2x2 2nd order (quadratic)

    y = a0 + a1x + a2x2 + a3x

    3 3rd order

    y = a0 + a1x + a2x2 + a3x

    3+ + anxn nth order

    As the order (highest power of x) increases, so does the potentialcomplexity of the model/polynomial

    This means that it can represent a more complex (non-smooth)

    function and thus approximate the data more accurately

  • 7/30/2019 NNs2_Basics.ppt

    26/28

    1st order: model too simple

    10th order: more accurate interms of passing thru data

    points but is too complex and

    non-smooth (curvy)

    3rd order: models

    underlying function well

  • 7/30/2019 NNs2_Basics.ppt

    27/28

    Note though that training error continues to go down as model

    matches the fine-scale detail of the data (ie the noise)

    Rather want to model the intrinsic dimensionalityof the data

    otherwise get the problem ofoverfitting

    Analagous to the problem of overtraining where a model is trained

    for too long and models the data too exactly and loses its generality

    As the model

    complexity grows

    performance improvesfor a while but starts

    to degrade sfter

    reaching an optimal

    level

  • 7/30/2019 NNs2_Basics.ppt

    28/28

    Similar problems occur in classification problems: A model with too

    much flexibility does not genralise well resulting in a non-smooth

    decision boundary.

    Somewhat like giving a system enough capacity to remember all

    training point: no need to generalise. Less memory => it mustgeneralise to be able to model training data

    Trade-off between being a good fit to the training data and achieving a

    good generalisation: cf Bias Variance trade off (later)