patternclassification lecturenotes

Upload: manoj

Post on 05-Apr-2018

227 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 patternclassification lecturenotes

    1/24

    Lecture Notes on Pattern Recognition and Image Processing

    Jonathan G. Campbell

    Department of Computing,

    Letterkenny Institute of Technology,

    Co. Donegal, Ireland.

    email: jonathan.campbell at gmail.com, [email protected]

    URL:http://www.jgcampbell.com/

    Report No: jc/05/0005/r

    Revision 1.1 (minor typos corrected, and Fig. 4.3 added).

    13th August 2005

  • 8/2/2019 patternclassification lecturenotes

    2/24

    Contents

    1 Introduction 2

    2 Simple classifier algorithms 3

    2.1 Thresholding for one-dimensional data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    2.2 Linear separating lines/planes for two-dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2.3 Nearest mean classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2.4 Normal form of the separating line, projections, and linear discriminants . . . . . . . . . . . . . . . . . . . . 6

    2.5 Projection and linear discriminant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.6 Projections and linear discriminants in p dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2.7 Template Matching and Discriminants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2.8 Nearest neighbour methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    3 Statistical methods 9

    3.1 Bayes Rule for the Inversion of Conditional Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Parametric Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    3.3 Discriminants based on Gaussian Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    3.4 Bayes-Gauss Classifier Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    3.4.1 Equal and Diagonal Covariances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    3.4.2 Equal but General Covariances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    3.5 Least square error trained classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    3.6 Generalised linear discriminant function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    4 Neural networks 16

    4.1 Neurons for Boolean Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    4.2 Three-layer neural network for arbitrarily complex decision regions . . . . . . . . . . . . . . . . . . . . . . 18

    4.3 Sigmoid activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    A Principal Components Analysis and Fishers Linear Discriminant Analysis 22

    A.1 Principal Components Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    A.2 Fishers Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    1

  • 8/2/2019 patternclassification lecturenotes

    3/24

    Chapter 1

    Introduction

    This note is a reworking of some of the basic pattern recognition and neural network material covered in (Campbell &

    Murtagh 1998); it includes some material from (Campbell 2000). Chapter 3 contains more detail than other chapters; except

    for the first page, this could be skipped on first reading.

    We define/summarize a pattern recognition system using the block diagram in Figure 1.1.

    Pattern Recognition

    System

    x w (omega)

    sensed

    dataclass

    (Classifier)

    Figure 1.1: Pattern recognition system; x a tuple of p measurements, output class label.

    In some cases, this might work; i.e. we collect the pixel values of a sub-image that contains the object of interest and place

    them in a p-tuple x and leave everything to the classifier algorithm. However, Figure 1.2 shows a more realistic setup. In this

    arrangement still somewhat idealised, for example, we assume that we have somehow isolated the object in a (sub)image

    we segment the image into object pixels (value 1, say) and background pixels (value 0).

    In such a setup we can do all the problem specific processing in the first two stages, and pass the feature vector (in generalp-dimensional) to a general purpose classifier.

    raw

    image

    binary

    image tuple

    features

    class

    label

    segment feature

    extractor

    classifier

    General pattern recognition

    (++)

    (*)

    * For example, in the trivial case of identifying 0 and 1

    ++ class = 0, 1; in a true OCR problem (AZ;09), class = 1..36.

    we could have a feature vector (x1 = width, x2 = ink area).

    Figure 1.2: Image pattern recognition problem separated.

    Pattern recognition (classification) may be posed as an inference problem. The inference involves class labels, that is we

    have a set of examples (training data), XT = {xi,i}ni=1. x is the pattern vector of course, we freely admit that in certain

    situations x is a simple scalar. is the class label, = {1, . . . ,c}; given an unseen pattern x, we infer . In general,x = (x0 x1 . . . xp1)

    T, a p-dimensional vector; T denotes transposition.

    From now on we concentrate on classifier algorithms, and we refer only occasionally to concrete examples.

    2

  • 8/2/2019 patternclassification lecturenotes

    4/24

    Chapter 2

    Simple classifier algorithms

    2.1 Thresholding for one-dimensional data

    In our simplistic character recognition system we require to recognise two characters, 0 and 1. We extract two features:

    width and inked area. These comprise the feature vector, x = (x1 x2)T.

    x11 2 4 5 63

    T

    1 1 1 1 1 10 0 0 0 0 0 0 0 0 0

    Figure 2.1: Width data, x1.

    Let us see whether we can recognise using width alone (x1. Figure 2.1 shows some (training) data that has been collected.

    We see that a threshold (T) set at about x1 = 2.8 is the best we can do; the classification algorithm is:

    = 1 when x1 T, (2.1)= 0 otherwise. (2.2)

    Use of histograms, see Figure 2.2 might be a more methodical way of determining the threshold, T.

    If enough training data were available, n, the histograms, h0(x1),h1(x1), properly normalised would approach probabilitydensities: p0(x1),p1(x1), more properly called class conditional probability densities: p(x1 | ), = 0,1, see Figure 2.3.

    When the feature vector is three-dimensional (p = 3) or more, it becomes impossible to estimate the probability densitiesusing histogram binning there are a great many bins, and most of them contain no data.

    3

  • 8/2/2019 patternclassification lecturenotes

    5/24

    x11 2 4 5 63

    T

    1 1 1 1 1 1

    0 0 0 0 0 0 0 0 0 0

    freq.

    h0(x1)

    h1(x1)

    Figure 2.2: Width data, x1, with histogram.

    x11 2 4 5 63

    T

    1 1 1 1 1 1

    0 0 0 0 0 0 0 0 0 0

    p(x1 | 1)

    p(x1 | 0)

    Figure 2.3: Class conditional densities.

    4

  • 8/2/2019 patternclassification lecturenotes

    6/24

    2.2 Linear separating lines/planes for two-dimensions

    Since there is overlap in the width, x1, measurement, let us use the two features, x = (x1x2)T, i.e. (width, area). Figure 2.4shows a scatter plot of these data.

    x11 2 3 4 5 6

    x2

    1

    3

    4

    5

    2

    0 0 0

    0 0 0 0

    0 0 0 0 0 0

    0 0 0 0 0 0

    0 0 0 0 0

    0 0 0 0

    0 0 0 0

    11 1 1 1

    1 1 1 1 1

    1 1 1 1 1 1

    1 1 1 1

    1 1 1

    Figure 2.4: Two dimensions, scatter plot.

    The dotted line shows that the data are separable by a straight line; it intercepts the axes at x1 = 4.5 and x2 = 6.

    Apart from plotting the data and drawing the line, how could we derive it from the data? (Thinking of a computer program.)

    2.3 Nearest mean classifier

    Figure 2.5 shows the line joining the class means and the perpendicular bisector of this line; the perpendicular bisector turns

    out to be the separating line. We can derive the equation of the separating line using the fact that points on it are equidistantto both means, 0,1, and expand using Pythagorass theorem,

    |x0|2 = |x1|

    2, (2.3)

    (x101)2 + (x202)

    2 = (x111)2 + (x212)

    2. (2.4)

    We eventually obtain

    (0111)x1 + (0212)x2 (2

    01 +2

    022

    112

    12) = 0, (2.5)

    which is of the form

    b1x1 + b2x2b0 = 0. (2.6)

    In Figure 2.5, 01 = 4,02 = 3,11 = 2,12 = 1.5; with these values, eqn 2.6 becomes

    4x1 + 3x218.75 = 0, (2.7)

    which intercepts the x1 axis at 18.75/4 4.7 and the x2 axis at 18.75/3 = 6.25.

    5

  • 8/2/2019 patternclassification lecturenotes

    7/24

    x11 2 3 4 5 6

    x2

    1

    3

    4

    5

    2

    0 0 0

    0 0 0 0

    0 0 0 0 0 0

    0 0 0 0 0 0

    0 0 0 0 0

    0 0 0 0

    0 0 0 0

    11 1 1 1

    1 1 1 1 1

    1 1 1 1 1 1

    1 1 1 1

    1 1 1

    Figure 2.5: Two dimensional scatter plot showing means and separating line.

    2.4 Normal form of the separating line, projections, and linear discriminants

    Eqn 2.6 becomes more interesting and useful in its normal form,

    a1x1 + a2x2a0 = 0, (2.8)

    where a21

    + a22

    = 1; eqn 2.8 can be obtained from eqn 2.6 by dividing across by

    b21

    + b22

    .

    Figure 2.6 shows interpretations of the normal form straight line equation, eqn 2.8. The coefficients of the unit vector normal

    to the line are n = (a1a2)T and a0 is the perpendicular distance from the line to the origin. Incidentally, the componentscorrespond to the direction cosines ofn = (a1a2)

    T = (cossina2)T. Thus, (Foley, van Dam, Feiner, Hughes & Phillips 1994)

    n corresponds to one row of a (frame) rotating matrix; in other words, see below, section 2.5, dot product of the vector

    expression of a point with n, corresponds to projection onto n. (Note that cos/2= sin.)

    Also as shown in Figure 2.6, points x = (x1x2)T on the side of the line to which n = (a1a2)T points have a1x1+ a2x2a0 > 0,whilst points on the other side have a1x1 + a2x2a0 < 0; as we know, points on the line have a1x1 + a2x2a0 = 0.

    2.5 Projection and linear discriminant

    We know that a1x1 + a2x2 = aTx, the dotproduct ofn = (a1a2)

    T and x represents the projection of points x onto n yielding

    the scalar value along n, with a0 fixing the origin. This is plausible: projecting onto n yields optimum separability.

    Such a projection,

    g(x) = a1x1 + a2x2, (2.9)

    is called a linear discriminant; now we can adapt equation eqn. 2.2,

    = 0 when g(x) > a0, (2.10)

    = 1,g(x) < a0, (2.11)

    = tie,g(x) = a0. (2.12)

    Linear discriminants, eqn. 2.12, are often written as

    6

  • 8/2/2019 patternclassification lecturenotes

    8/24

    theta

    a0

    normal vector (a1, a2)

    line

    a1x1 + a2x2 a0 = 0

    x1

    x2

    a0/a1

    a0/a2

    (x1 x2)

    at (x1, x2)

    a1x1 + a2x2 a0 < 0

    a1x1 + a2x2 a0 > 0

    Figure 2.6: Normal form of a straight line, interpretations.

    g(x) = a1x1 + a2x2a0, (2.13)

    whence eqn. 2.12 becomes

    = 0 when g(x) > 0, (2.14)

    = 1,g(x) < 0, (2.15)

    = tie,g(x) = 0. (2.16)

    2.6 Projections and linear discriminants in p dimensions

    Equation 2.13 readily generalises to p dimensions, n is a unit vector in p dimensional space, normal to the the p1 separatinghyperplane. For example, when p = 3, n is the unit vector normal to the separating plane.

    Other important projections used in pattern recognition are Principal Components Analysis (PCA), see section A.1 and

    Fishers Linear Discriminant, see section A.2.

    2.7 Template Matching and Discriminants

    An intuitive (but well founded) classification method is that of template matching or correlation matching. Here we haveperfect or average examples of classes stored in vectors {zj}cj=1, one for each class. Without loss of generality, we assumethat all vectors are normalised to unit length.

    Classification of an newly arrived vector x entails computing its template/correlation match with all c templates:

    xTzj; (2.17)

    class is chosen as j corresponding to the maximum of eqn. 2.17.

    Yet again we see that classification involves dot product, projection, and a linear discriminant.

    7

  • 8/2/2019 patternclassification lecturenotes

    9/24

    2.8 Nearest neighbour methods

    Obviously, we may not always have the linear separability of Figure 2.5. One non-parametric method is to go beyond nearest

    mean, see eqn. 2.4, to compute the nearest neighbourin the entire training data set, and to decide class according to the class

    of the nearest neighbour.

    A variation is k nearest neighbour, where a vote is taken over the classes of k nearest neighbours.

    8

  • 8/2/2019 patternclassification lecturenotes

    10/24

    Chapter 3

    Statistical methods

    Recall Figure 2.3, repeated here as Figure 3.1.

    x11 2 4 5 63

    T

    1 1 1 1 1 1

    0 0 0 0 0 0 0 0 0 0

    p(x1 | 1)

    p(x1 | 0)

    Figure 3.1: Class conditional densities.

    We have class conditional probability densities: p(x1 | ), = 0,1; given a newly arrived x1

    we might decide on its class

    according to the maximum class conditional probability density at x1

    , i.e. set a threshold T where p(x1 | 0) and p(x1 | 1)cross, see Figure 3.1.

    This is not completely correct. What we want is the probability of each class (its posterior probability) based on the evidence

    supplied by the data, combined with any prior evidence.

    In what follows, P( | x) is the posterior probability or a posteriori probability of class given the observation x; P(w)is the prior probability or a priori probability. We use upper case P(.) for discrete probabilities, whilst lower case p(.)

    denotes probability densities.

    Let the Bayes decision rule be represented by a function g(.) of the feature vector x:

    g(x) = arg maxwj[P(j | x)] (3.1)

    To show that the Bayes decision rule, eqn. 3.1, achieves the minimum probability of error, we compute the probability of

    error conditional on the feature vector x the conditional risk associated with it:

    R(g(x) = j | x) =c

    k=1,k=j

    P(k | x). (3.2)

    9

  • 8/2/2019 patternclassification lecturenotes

    11/24

    That is to say, for the point x we compute the posterior probabilities of all the c1 classes not chosen.

    Since = {1, . . . ,c} form an event space (they are mutually exclusive and exhaustive) and the P(k | x)ck=1 are probabili-

    ties and so sum to unity, eqn. 3.2 reduces to:

    R(g(x) = j) = 1P(j | x). (3.3)

    It immediately follows that, to minimise R(g(x) = j), we maximise P(j | x), thus establishing the optimality of eqn. 3.1.

    The problem now is to determine P( | x) which brings us to Bayes rule.

    3.1 Bayes Rule for the Inversion of Conditional Probabilities

    From the definition of conditional probability, we have:

    p(,x) = P( | x)p(x), (3.4)

    and, owing to the fact that the events in a joint probability are interchangeable, we can equate the joint probabilities :

    p(,x) = p(x,) = p(x | )P(). (3.5)

    Therefore, equating the right hand sides of these equations, and rearranging, we arrive at Bayes rule for the posterior proba-

    bility P( | x):

    P( | x) =p(x | )P()

    p(x). (3.6)

    P() expresses our belief that will occur, prior to any observation. If we have no prior knowledge, we can assume equalpriors for each class: P(1) = P(2) . . . = P(c),

    cj=1P(j) = 1. Although we avoid further discussion here, we note that

    the matter of choice of prior probabilities is the subject of considerable discussion especially in the literature on Bayesian

    inference, see, for example, (Sivia 1996).

    p(x) is the unconditional probability density ofx, and can be obtained by summing the conditional densities:

    p(x) =c

    j=1

    p(x | j)P(j). (3.7)

    Thus, to solve eqn. 3.6, it remains to estimate the conditional densities.

    3.2 Parametric Methods

    Where we can assume that the densities follow a particular form, for example Gaussian, the density estimation problem is

    reduced to that of estimation of parameters.

    The multivariate Gaussian or Normal density, p-dimensional, is given by:

    p(x | j) =1

    (2)p/2 | Kj |1/2exp[

    1

    2(xj)

    TK1j (xj)] (3.8)

    p(x | j) is completely specified by j, the p-dimensional mean vector, and Kj the corresponding pp covariance matrix:

    j = E[x]=j , (3.9)

    Kj = E[(xj)(xj)T]=j . (3.10)

    The respective maximum likelihood estimates are:

    j =1

    Nj

    Nj

    n=1

    xn, (3.11)

    and,

    Kj =1

    Nj 1

    Nj

    n=1

    (xnj)(xnj)T, (3.12)

    where we have separated the training data XT = {xn,n}Nn=1 into sets according to class.

    10

  • 8/2/2019 patternclassification lecturenotes

    12/24

    3.3 Discriminants based on Gaussian Density

    We may write eqn. 3.6 as a discriminant function:

    gj(x) = P(j | x) =p(x | j)P(j)

    p(x), (3.13)

    so that classification, eqn. 3.1, becomes a matter of assigning x to class wj if,

    gj(x) > gk(x), k = j. (3.14)

    Since p(x), the denominator of eqn. 3.13 is the same for all gj(x) and since eqn. 3.14 involves comparison only, we mayrewrite eqn. 3.13 as

    gj(x) = p(x | j)P(j). (3.15)

    We may derive a further possible discriminant by taking the logarithm of eqn. 3.15 since logarithm is a monotonically

    increasing function, application of it preserves relative order of its arguments:

    gj(x) = log p(x | j) + log P(j). (3.16)

    In the multivariate Gaussian case, eqn. 3.16 becomes (Duda & Hart 1973),

    gj(x) = 1

    2(xj)

    TK1

    j

    (xj)p

    2log2

    1

    2log | Kj |+logP(j) (3.17)

    Henceforth, we refer to eqn. 3.17 as the Bayes-Gauss classifier.

    The multivariate Gaussian density provides a good characterisation of pattern (vector) distribution where we can model the

    generation of patterns as ideal pattern plus measurement noise; for an instance of a measured vector x from class j:

    xn = j + en, (3.18)

    where en N(0,Kj), that is, the noise covariance is class dependent.

    3.4 Bayes-Gauss Classifier Special Cases

    (Duda & Hart 1973, pp. 2631)

    Revealing comparisons with the other learning paradigms which play an important role in this thesis are made possible if we

    examine particular forms of noise covariance in which the Bayes-Gauss classifier decays to certain interesting limiting forms:

    Equal and Diagonal Covariances (Kj = 2I,j, where I is the unit matrix); in this case certain important equivalences

    with eqn. 3.17 can be demonstrated:

    Nearest mean classifier;

    Linear discriminant;

    Template matching;

    Matched filter;

    Single layer neural network classifier.

    Equal but Non-diagonal Covariance Matrices.

    Nearest mean classifier using Mahalanobis distance;

    and, as in the case of diagonal covariance,

    Linear discriminant function;

    Single layer neural network;

    11

  • 8/2/2019 patternclassification lecturenotes

    13/24

    3.4.1 Equal and Diagonal Covariances

    When each class has the same covariance matrix, and these are diagonal, we have, Kj = 2I, so that K1j =

    1

    2I. Since the

    covariance matrices are equal, we can eliminate the 12| logKj |; the

    p2

    log2 term is constant in any case; thus, including the

    simplification of the (xj)TK1j (xj), eqn. 3.17 may be rewritten:

    gj(x) = 1

    22(xj)

    T(xj) + logP(j) (3.19)

    =1

    22xj)

    2 + logP(j). (3.20)

    Nearest mean classifier If we assume equal prior probabilities P(j), the second term in eqn. 3.20 may be eliminated forcomparison purposes and we are left with a nearest mean classifier.

    Linear discriminant If we further expand the squared distance term, we have,

    gj(x) = 1

    22(xTx2Tj x +

    Tj j) + logP(j), (3.21)

    which may be rewritten as a lineardiscriminant:

    gj(x) = wj0 + wTj x (3.22)

    wherewj0 =

    1

    22(Tj j) + logP(j), (3.23)

    and

    wj =1

    2j. (3.24)

    Template matching In this latter form the Bayes-Gauss classifier may be seen to be performing template matching or

    correlation matching, where wj = constantj, that is, the prototypical pattern for class j, the mean j, is the template.

    Matched filter In radar and communications systems a matched filter detector is an optimum detector of (subsequence)

    signals, for example, communication symbols. If the vector x is written as a time series (a digital signal), x[n],n = 0,1, . . .then the matched filter for each signal j may be implemented as a convolution:

    yj[n] = x[n]h[n] =N1

    m=0

    x[nm] hj[m], (3.25)

    where the kernel h[.] is a time reversed template that is, at each time instant, the correlation between h[.] and the last Nsamples ofx[.] are computed. Provided some threshold is exceeded, the signal achieving the maximum correlation is detected.

    Single Layer Neural Network If we restrict the problem to two classes, we can write the classification rule as:

    g(x) = g1(x)g2(x) 0 : 1, otherwise 2 (3.26)

    = w0 + wT

    x, (3.27)

    where w0 =1

    22(T

    11T22) + log

    P(1)P(2)

    and w = 12

    (12).

    In other words, eqn. 3.27 implements a linear combination, adds a bias, and thresholds the result that is, a single layer

    neural network with a hard-limit activation function.

    (Duda & Hart 1973) further demonstrate that eqn. 3.20 implements a hyper-plane partitioning of the feature space.

    12

  • 8/2/2019 patternclassification lecturenotes

    14/24

    3.4.2 Equal but General Covariances

    When each class has the same covariance matrix, K, eqn. 3.17 reduces to:

    gj(x) = (xj)TK1(xj) + logP(j) (3.28)

    Nearest Mean Classifier, Mahalanobis Distance If we have equal prior probabilities P(j), we arrive at a nearest mean

    classifier where the distance calculation is weighted. The Mahalanobis distance (xj)TK1j (xj) effectively weights

    contributions according to inverse variance. Points of equal Mahalanobis distance correspond to points of equal conditional

    density p(x | j).

    Linear Discriminant Eqn. 3.28 may be rewritten as a lineardiscriminant, see also section 2.5:

    gj(x) = wj0 + wTj x (3.29)

    where

    wj0 = 1

    2(Tj K

    1j) + logP(j), (3.30)

    and

    wj = K1j. (3.31)

    Weighted template matching, matched filter In this latter form the Bayes-Gauss classifier may be seen to be performing

    weightedtemplate matching.

    Single Layer Neural Network As for the diagonal covariance matrix, it can be easily demonstrated that, for two classes,

    eqns. 3.29 3.31 may be implemented by a single neuron. The only difference from eqn. 3.27 is that the non-bias weights,

    instead of being simple a difference between means, is now weighted by the inverse of the covariance matrix.

    3.5 Least square error trained classifier

    We can formulate the problem of classification as a least-square-error problem. Let us require the classifier to output a class

    membership indicator [0,1] for each class, we can write:

    d = f(x) (3.32)

    where d = (d1,d2, . . .dc)T is the c-dimensional vector of class indicators and x, as usual, the p-dimensional feature vector.

    We can express individual class membership indicators as:

    dj = b0 +p

    i=1

    bixi + e. (3.33)

    In order to continue the analysis, we recall the theory of linear regression

    (Beck & Arnold 1977).

    Linear Regression The simplest linear model, y = mx + c, of school mathematics, is given by:

    y = b0 + b1x + e, (3.34)

    which shows the dependence of the dependent variable y on the independent variable x. In other words, y is a linear function

    of x and the observation is subject to noise, e; e is assumed to be a zero-mean random process. Strictly eqn. 3.34 is affine,

    since b0 is included, but common usage dictates the use of linear. Taking the nth observation of (x,y), we have (Beck &Arnold 1977, p. 133):

    yn = b0 + b1xn + en (3.35)

    13

  • 8/2/2019 patternclassification lecturenotes

    15/24

    Least square error estimators for b0 and b1, b0 and b1 may be obtained from a set of paired observations {xn,yn}Nn=1 byminimising the sum of squared residuals:

    S =N

    n=1

    r2n =N

    n=1

    (yn yn)2 (3.36)

    S =N

    n=1

    (ynb0b1xn)2 (3.37)

    Minimising with respect to b0 and b1, and replacing these with their estimators, b0 and b1, gives the familiar result:

    b1 = N[ynxn (yi)(xi)]/[N(x2

    i ) (xi)2] (3.38)

    b0 =yn

    Nxn

    b1xn

    N(3.39)

    The validity of these estimates does not depend on the distribution of the errors en; that is, assumption of Gaussianity is not

    essential. On the other hand, all the simplest estimation procedures, including eqns. 3.38 and 3.39, assume the xn to be error

    free, and that the error en is associated with yn.

    In the case where y, still one-dimensional, is a function of many independent variables p in our usual formulation of

    p-dimensional feature vectors eqn. 3.35 becomes:

    yn = b0 +p

    i=1

    bixin + en (3.40)

    where xin is the i-th component of the n-th feature vector.

    Eqn. 3.40 can be written compactly as:

    yn = xTn b + en (3.41)

    where b = (b0,b1, . . . ,bp)T is a p + 1 dimensional vector of coefficients, and xn = (1,x1n,x2n, . . . ,xpn) is the augmented

    feature vector. The constant 1 in the augmented vector corresponds to the coefficient b0, that is it is the so called bias term of

    neural networks, see sections 2.5 and 4.

    All N observation equations may now be collected together:

    y = Xb + e (3.42)

    where y = (y1,y2, . . . ,yn, . . . ,y

    N)T is the N 1 vector of observations of the dependent variable, and e =

    (e1,e2, . . . ,en, . . . ,eN)T. X is the Np + 1 matrix formed by N rows of p + 1 independent variables.

    Now, the sum of squared residuals, eqn. 3.36, becomes:

    S = (yXb)T. (3.43)

    Minimising with respect to b just as eqn. 3.36 was minimised with respect to b0 and b1 leads to a solution for b (Beck

    & Arnold 1977, p. 235):b = (XTX)1XTy. (3.44)

    The jk-th element of the (p + 1) (p + 1) matrix XTX is Nn=1xn jxnk, in other words, just N the jk-th element of theautocorrelation matrix, R, of the vector of independent variables x estimated from the N sample vectors.

    If, as in the least-square-error classifier, we have multiple dependent variables (y), in this case, c of them, we can replace y ineqn. 3.44 with an appropriate matrix N c matrix Y formed by N rows each ofc observations. Now, eqn. 3.44 becomes:

    B = (XTX)1XTY (3.45)

    XTY is a p + 1 c matrix, and B is a (p + 1) c matrix of coefficients that is, one column of p + 1 coefficients for eachclass.

    Eqn. 3.45 defines the training algorithm of our classifier.

    14

  • 8/2/2019 patternclassification lecturenotes

    16/24

    Application of the classifier to a feature vector x may be expressed as:

    y = Bx. (3.46)

    It remains to find the maximum of the c components ofy.

    In a two-class case, this least-square-error training algorithm yields an identical discriminant to Fishers linear discriminant

    (Duda & Hart 1973). Fishers linear discriminant is described in section A.2.

    Eqn. 3.45 has one significant weakness: it depends on the condition of the matrix XTX. As with any autocorrelation or

    auto-covariance matrix, this cannot be guaranteed; for example, linearly dependent features will render the matrix singular.

    In fact, there is an elegant indirect implementation of eqn. 3.45 involving the singular value decomposition (SVD) (Press,

    Flannery, Teukolsky & Vetterling 1992), (Golub & Van Loan 1989). The Widrow-Hoff iterative gradient-descent trainingprocedure (Widrow & Lehr 1990) developed in the early 1960s tackles the problem in a different manner.

    3.6 Generalised linear discriminant function

    Eqn. 2.13 may be adapted to cope with any function(s) of the features xi; we can define a new feature vector x where:

    xk = fk(x). (3.47)

    In the pattern recognition literature, the solution of eqn. 3.47 involving now the vector x is called the generalised linear

    discriminant function (Duda & Hart 1973, Nilsson 1965).

    It is desirable to escape from the fixed model of eqn. 3.47: the form of the fk(x) must be known in advance. Multilayerperceptron (MLP) neural networks provide such a solution. We have already shown the correspondence between the linear

    model, eqn. 3.41, and a single layer neural network with a single output node and linear activation function. An MLP with

    appropriate non-linear activation functions, e.g. sigmoid, provides a model-free and arbitrary non-linear solution to learning

    the mapping between x and y (Kosko 1992, Hecht-Nielsen 1990, Haykin 1999).

    15

  • 8/2/2019 patternclassification lecturenotes

    17/24

    Chapter 4

    Neural networks

    Here we show that a single neuron implements a linear discriminant (and hence also implements a separating hyperplane).

    Then we proceed to a discussion which indicates that a neural network comprising three processing layers can implement any

    arbitrarily complex decision region.

    Recall eqn. 2.12, with ai wi, and now (arbitrarily) allocating discriminant value zero to class 0,

    g(x) =p

    i=1

    wixiw0

    0, = 0

    > 0, = 1.(4.1)

    Figure 4.1 shows a single artificial neuron which implements precisely eqn. 4.1.

    w0

    x1

    x2

    .

    .

    .

    xp

    +1 (bias)

    w1

    w2

    wp

    Figure 4.1: Single neuron.

    The signal flows into the neuron (circle) are weighted; the neuron receives wixi. The neuron sums and applies a hard limit

    (output = 1 when sum > 0, otherwise 0). Later we will introduce a sigmoid activation function (softer transition) instead ofthe hard limit.

    The threshold term in the linear discriminant (a0 in eqn. 2.13) is provided by w0+1. Another interpretation of bias, usefulin mathematical analysis of neural networks, see section 3.5, is to represent it by a constant component, +1, as the zerothcomponent of the augmentedfeature vector.

    Just to reemphasise the linear boundary nature of linear discriminants (and hence neural networks), examine the two-

    dimensional case,

    w1x1 + w2x2w0

    0, = 0

    > 0, = 1.(4.2)

    The boundary between classes, given by w1x1 + w2x2w0 = 0, is a straight line with x1-axis intercept at w0/w1 and x2-axisintercept at w0/w2, see Figure 4.2.

    16

  • 8/2/2019 patternclassification lecturenotes

    18/24

    x1

    x2

    w0/w2

    w1/w0

    Figure 4.2: Separating line implemented by two-input neuron.

    17

  • 8/2/2019 patternclassification lecturenotes

    19/24

    4.1 Neurons for Boolean Functions

    A neuron with weights w0 = 0.5,and w1 = w2 = 0.35 implements a Boolean AND:

    x1 x2 AND(x1,x2) Neuron summation Hard-limit (>0?)

    ----------------- ------------------------------ --------------

    0 0 0 sum = -0.5 + 0.35x1 + 0.35x2 = -0.5 => output= 0

    1 0 0 sum = -0.5 + 0.35x1 + 0.35x2 = -0.15 => output= 0

    0 1 0 sum = -0.5 + 0.35x1 + 0.35x2 = -0.15 => output= 0

    1 1 1 sum = -0.5 + 0.35x1 + 0.35x2 = +0.2 => output= 1

    ------------------ ------------------------------- -------------

    Similarly, a neuron with weights w0 =0.25,and w1 = w2 = 0.35 implements a Boolean OR.

    Figure 4.3 shows the x1-x2-plane representation of AND, OR, and XOR (exclusive-or).

    x1

    x2x2

    x1

    x2

    x1

    AND ORXOR

    1

    1 1

    11

    1

    0

    0

    0

    1

    0

    1

    1

    1

    0

    1

    1

    0

    Figure 4.3: AND, OR, XOR.

    It is noted that XOR cannot be implemented by a single neuron; in fact it required two layers. Two layer were a big problem

    in the first wave of neural network research in the 1960s, when it was not known how to train more than one layer.

    4.2 Three-layer neural network for arbitrarily complex decision regions

    The purpose of this section is to give an intuitive argument as to why three processing layers can implement an arbitrarily

    complex decision region.

    Figure 4.4 shows such a decision region in two-dimensions.

    As shown in the figure, however, each island of class 1 may be delineated using a series of boundaries, d11,d12,d13,d14 andd21,d22,d23,d24.

    Figure 4.5 shows a three-layer network which can implement this decision region.

    First, just as before, input neurons implement separating lines (hyperplanes), d11, etc. Next, in layer 2, we AND together the

    decisions from the separating hyperplanes to obtain decisions, in island 1, in island 2. Finally, in the output layer, we OR

    together the latter decisions; thus we can construct an arbitrarily complex partitioning.

    Of course, this is merely an intuitive argument. A three layer neural network trained with back-propagation or some other

    technique might well achieve the partitioning in quite a different manner.

    4.3 Sigmoid activation functions

    If a neural network is to be trained using backpropagation or similar technique, hard limitactivation functions cause problems

    (associated with differentiation). Sigmoidactivation functions are used instead. A sigmoid activation function corresponding

    to the hard limit progresses from output value 0 at , passes through 0 with value 0.5 and flattens out at value 1 at +.

    18

  • 8/2/2019 patternclassification lecturenotes

    20/24

    x11 2 3 4 5 6

    x2

    1

    3

    4

    5

    21

    1 1 1 1

    1 1 1 1 1

    1 1 1 1 1 1

    1 1 1 1

    1 1 1

    1 1

    1 1 1 11 1 1 1 1 1

    1 1 1 1 1

    1 1 1 1

    0 0 0 0 0 0 0

    0 0 0 0 0 0 0

    0 0 0 0 0 0

    0 0 0 0 0 0 0

    0 0 0 0 0

    0 0 0 0

    0 0 0 0

    0 0 0 0

    0 0 0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0

    0 0

    0 0

    0 0

    0 0 0

    d11

    d12

    d13

    d14

    d21

    d22

    d23

    d24

    Figure 4.4: Complex decision region required.

    19

  • 8/2/2019 patternclassification lecturenotes

    21/24

    x1

    x2

    .

    .

    .

    xp

    .

    .

    .

    d11

    d12

    d13

    d14

    d21

    d24

    . . .

    +1 (bias)

    +1

    +1

    +1

    .

    .

    .

    AND

    AND

    OR

    class

    Figure 4.5: Three-layer neural network implementing an arbitrarily complex decision region.

    20

  • 8/2/2019 patternclassification lecturenotes

    22/24

    Bibliography

    Beck, J. & Arnold, K. (1977). Parameter Estimation in Engineering and Science, John Wiley & Sons, New York.

    Campbell, J. (2000). Fuzzy Logic and Neural Network Techniques in Data Analysis , PhD thesis, University of Ulster.

    Campbell, J. & Murtagh, F. (1998). Image processing and pattern recognition, Technical report, Computer Science, Queens

    University Belfast. available at: http://www.jgcampbell.com/ip.

    Duda, R. & Hart, P. (1973). Pattern Classification and Scene Analysis, Wiley-Interscience, New York.

    Duda, R., Hart, P. & Stork, D. (2000). Pattern Classification, Wiley-Interscience.

    Fisher, R. (1936). The use of multiple measurements in taxonomic problems, Annals of Eugenics 7: 179188.

    Foley, J., van Dam, A., Feiner, S., Hughes, J. & Phillips, R. (1994). Introduction to Computer Graphics, Addison Wesley.

    Golub, G. & Van Loan, C. (1989). Matrix Computations, 2nd edn, Johns Hopkins University Press, Baltimore.

    Haykin, S. (1999). Neural Networks - a comprehensive foundation, 2nd edn, Prentice Hall.

    Hecht-Nielsen, R. (1990). Neurocomputing, Addison-Wesley, Redwood City, CA.

    Kosko, B. (1992). Neural Networks and Fuzzy Systems. A Dynamical Systems Approach to Machine Intelligence, Prentice

    Hall, Englewood Cliffs.

    Nilsson, N. (1965). Learning Machines, McGraw-Hill, New York.

    Press, W., Flannery, B., Teukolsky, S. & Vetterling, W. (1992). Numerical Recipes in C, 2nd edn, Cambridge University

    Press, Cambridge, UK.

    Sivia, D. (1996). Data Analysis, A Bayesian Tutorial, Oxford University Press, Oxford, U.K.

    Venables, W. & Ripley, B. (2002). Modern Applied Statistics with S, 4th edn, Springer-Verlag.

    Widrow, B. & Lehr, M. (1990). 30 Years of Adaptive Neural Networks, Proc. IEEE78(9): 14151442.

    21

  • 8/2/2019 patternclassification lecturenotes

    23/24

    Appendix A

    Principal Components Analysis and FishersLinear Discriminant Analysis

    A.1 Principal Components Analysis

    Principal component analysis (PCA), also called Karhunen-Loeve transform (Duda, Hart & Stork 2000) is a linear transfor-

    mation which maps a p-dimensional feature vector x Rp to another vector y Rp where the transformation is optimised

    such that the components ofy contain maximum information in a least-square-error sense. In other words, if we take the firstr p components (y Rq), then using the inverse transformation, we can reproduce x with minimum error. Yet another viewis that the first few components of y contain most of the variance, that is, in those components, the transformation stretches

    the data maximally apart. It is this that makes PCA good for visualisation of the data in two dimensions, i.e. the first two

    principal components give an optimum view of the spread of the data.

    We note however, unlike linear discriminant analysis, see section A.2, PCA does not take account of class labels. Hence it is

    typically a more useful visualisation of the inherent variability of the data.

    In general x can be represented, without error, by the following expansion:

    x = Uy =p

    i=1

    yiui (A.1)

    where

    yi is the ith component ofy and (A.2)

    where

    U = (u1,u2, . . . ,up) (A.3)

    is an orthonormal matrix:

    utjuk = jk = 1, when i = k; otherwise = 0. (A.4)

    If we truncate the expansion at i = q

    x = Uqy =q

    i=1

    yiui, (A.5)

    we obtain a least square error approximation ofx, i.e.

    |xx|= minimum. (A.6)

    22

  • 8/2/2019 patternclassification lecturenotes

    24/24

    The optimum transformation matrix U turns out to be the eigenvector matrix of the sample covariance matrix C:

    C =1

    NAtA, (A.7)

    where A is the Np sample matrix.

    UCUt = , (A.8)

    the diagonal matrix of eigenvalues.

    A.2 Fishers Linear Discriminant Analysis

    In contrast with PCA (see section A.1), linear discriminant analysis (LDA) transforms the data to provide optimal class

    separability (Duda et al. 2000) (Fisher 1936).

    Fishers original LDA, for two-class data, is obtained as follows. We introduce a linear discriminant u (a p-dimensional vector

    of weights the weights are very similar to the weights used in neural networks) which, via a dot product, maps a feature

    vector x to a scalar,

    y = utx. (A.9)

    u is optimised to maximise simultaneously, (a) the separability of the classes ( between-class separability), and (b) the clus-

    tering together of same class data (within-class clustering). Mathematically, this criterion can be expressed as:

    J(u) =utSBu

    utSWu. (A.10)

    where SB is the between-class covariance,

    SB = (m1m2)(m1m2)t, (A.11)

    and

    Sw = C1 + C2, (A.12)

    the sum of the class-conditional covariance matrices, see section A.1.

    m1 and m2 are the class means.

    u is now computed as:

    u = S1w m1m2. (A.13)

    There are other formulations of LDA (Duda et al. 2000) (Venables & Ripley 2002), particularly extensions from two-class to

    multi-class data.

    In addition, there are extensions (Duda et al. 2000) (Venables & Ripley 2002) which form a second discriminant, orthogonal

    to the first, which optimises the separability and clustering criteria, subject to the orthogonality constraint. The second

    dimension/discriminant is useful to allow the data to be view as a two-dimensional scatter plot.