feature reduction.pdf

Upload: biljets

Post on 01-Jun-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/9/2019 Feature reduction.pdf

    1/43

    1Dipartimento di Ingegneria

    Biofisica ed Elettronica Università di Genova

    Prof. Sebastiano B. Serpico 

    4. Feature Reduction

  • 8/9/2019 Feature reduction.pdf

    2/43

    2

    Complexity of a Classifier

    • Increasing the number n of the  features , the classifying designpresents different issues connected to the dimensionality of theproblem (“curse of dimensionality”):

    – Computational complexity;

    – Hughes phenomenon.

    • Computational complexity 

    – Increasing n , the computational complexity of a classifierincreases. For some classification techniques this increment islinear with n , for other is of a higher order (e.g. quadratic).

    – The increase in complexity involves an increase of computationtime and a larger memory occupation.

  • 8/9/2019 Feature reduction.pdf

    3/43

    3

    Hughes Phenomenon

    • Intuitive reasoning

    –  increasing n , the amount ofavailable information for theclassifier should increase andconsequently also the classifi-cation accuracy, but...

    • Experimental observation 

    –   …  on the contrary, fixed the

    number N   of the trainingsamples, the probability of acorrect decision of a classifierincreases for 1   n    n* till amaximum and decreases forn n* (Hughes phenomenon).

    • Interpretation 

    – Increasing n , the number ofparameters K n  of the classifier becomes higher and higher.

    –By increasing the ratio K n/N  ,the number of the availabletraining  samples is “too  little” to obtain a satisfactory estima-te of such parameters.

  • 8/9/2019 Feature reduction.pdf

    4/43

    4

    Feature Reduction

    • A solution to these dimensionality issues is to reduce thenumber n  of the  features  used in the classification process(feature reduction or parameter reduction).

    • Disadvantage: reducing the dimension of the  feature  space involves a loss of information.

    • Two main strategies exist to achieve feature reduction:– feature selection: inside the set of the n available  features , the

    identification of a subset of m features  is obtained by adopting ofan optimization criterion, chosen to minimize the loss ofinformation or maximize classification accuracy;

    – feature extraction: the transformation (often linear) of the original(n-dimensional) feature space in a space of smaller dimension m isapplied in such a way to minimize the information loss ormaximize the classification accuracy.

  • 8/9/2019 Feature reduction.pdf

    5/43

    5

    Feature Selection

    • Problem setting:

    – Given a set X  = {x1 , x

    2 , … , x

    n

    } of n features , identify the subset S    X , composed of m features  (m  < n), such to maximize thefunctional  (·):

    • An algorithm for features selection is then defined on the basisof two distinct objects:– the functional   (·). It has to be defined such that   (S) measures

    the “goodness” of the feature subset S in the classification process;

    – the algorithm for the search of the subset S*. The subsets of X are

    in fact 2

    n

     , then an exhaustive search is computationally notfeasible, except for small values of n. Therefore, sub-optimalstrategies of maximization are adopted to detect “good” solutions, even if they do not correspond to global optima.

    * arg max ( )S X

    S S

  • 8/9/2019 Feature reduction.pdf

    6/43

    6

    Bhattacharyya Bounds

    • A choice of the functional   (·), which is significant from the

    classification point of view, can be based on the criterion of theminimum of the error probability Pe.

    – In the presence of two classes 1 and 2 , only, the Bhattacharyya distance B and the Bhattacharyya coefficient  provide an upper 

     bound of Pe:

    – Moreover, it’s possible to demonstrate that:

    2 2

    2

    1 1 1 42 2

    e u u

    u e u

    P

    P

    1 2

    1 2

    exp( )

    where ln e ( | ) ( | )

    n

    e uP P P B

    B p p dx x x

  • 8/9/2019 Feature reduction.pdf

    7/43

    7

    Bhattacharyya Distance and Coefficient

    • An approach to feature selection consists in the maximizationof the Bhattacharyya  distance B  or (equivalently) in theminimization of the Bhattacharyya coefficient .

    – In particular, a distance B(S) (or a coefficient (S)) can beassociated to each subset S  of m features; in fact, indicating avector of the feature subset S with xS , one can define:

    • Properties:

    – 0  (S)  1 and then B(S)  0;

    – if p(xS| 1) and p(xS| 2) are different from zero only in separeted

    regions, then (S) = 0 and B(S) = + ;– if p(xS| 1) = p(x

    S| 2) for any xS , then (S) is the integral of a pdf

    over the entire space , therefore (S) = 1 and B(S) = 0.

    1 2( ) ln ( ) e ( ) ( | ) ( | )m

    S S SB S S S p p d   x x x

    m

  • 8/9/2019 Feature reduction.pdf

    8/43

    8Computation of the Bhattacharyya Coefficient and Distance

    •   (S) is a multiple integral in an m-dimensional space, then itsanalytical computation  starting from the conditional pdf iscomplex. Two particular cases, in which the computation issimple, exist:

    – if the features in the subset S are independent , when conditionedto each class, we have:

    then the following property is valid:

    – if p(xS| i) =   (miS , iS) (i = 1, 2), we obtain:1 2

    1

    1 22 1 2 1

    1 2

    21 1( ) ( ) ( ) ln

    8 2 2

    S S

    S SS S t S S

    S SB S

      m m m m

    ( ) ({ }) e ( ) ({ })rr

    r rx Sx S

    S x B S B x

     

    ( | ) ( | ) 1,2r

    Si r i

    x S p p x i

    x Additive property

  • 8/9/2019 Feature reduction.pdf

    9/43

    9

    Other Inter-Class Distances

    • In addition to the Bhattacharyya distance, different measuresof inter-class distances have been introduced in literature.

    – For example, the Divergence  measures the separation betweentwo classes as a function of the likelihood ratio between therespective conditional pdfs.

    – The Bhattacharyya distance and the Divergence are not  upperlimited. This make them less appropriate as measures of inter-class separation. In fact, if we focus on the Gaussian case forsimplicity, when two classes are well-separated, an increment ofthe distance m1

    S – m2S between the conditional means generates

    a “large” increment of B(S), but an irrelevant reduction of Pe.

    – Then, other measures of inter-classes distance have beenproposed (not treated in depth here) that, being upper limited,don’t present such a problem. Among them, we recall the Jeffries-Matusita Distance and the Modified Divergence [Richards 1999,Swain 1978].

  • 8/9/2019 Feature reduction.pdf

    10/43

    10

    Multiclass Extension

    • Extension to the case of the M classes 1 , 2 , … ,  M.

    – If ij(S) and Bij(S) are the Bhattacharyya coefficient and distance between two classes i and  j computed over a feature subset S and ifPi  = P(i) is the a priori  probability of the class i , the followingaverage Bhattacharyya  coefficient and average Bhattacharyya distance are defined:

    • Remarks

    – In the case  M  = 2, the maximization of B(S) was equivalent to theminimization of (S) (because B(S) = – ln  (S)). In the multiclass casethe maximization of Bave(S) is no more equivalent to the

    minimization of ave(S), because the relation between Bave(S) andave(S) is no more monotonic.

    – Under the hypothesis of class conditional feature independence, wehave:

    1 1

    1 1 1 1

    ( ) ( ), ( ) ( ) M M M M

    ave i j ij ave i j iji j i i j i

    S P P S B S P P B S

     ( ) ({ })r

    ave ave rx S

    B S B x Attention! It is notvalid for ave(S).

  • 8/9/2019 Feature reduction.pdf

    11/43

    11

    Maximization of the Functional

    • In a problem of  feature  selection  the introduced measures ofinter-class separation are used in the role of the functional   (·)

    to be maximized

    • Preliminary observations:

    – The Bhattacharyya distance has to be maximized, while theBhattacharyya coefficient has to be minimized. Therefore, in the

    following,  (S) may correspond to Bave(S) or to – ave(S).– An exhaustive search over all possible subsets of X is, in general,

    computationally not affordable.

    –   It’s  feasible if the  features are independent , when conditioned toeach class, and if the adopted functional is Bave. In such a case, in

    fact, computed all values of the functional associated to the single features , for the additive property, the optimum subset S* of m features  is simply composed by the m features  that individuallypresents the highest m values of Bave({xr}).

  • 8/9/2019 Feature reduction.pdf

    12/43

    12Sequential Forward Selection  

    • In general, the search for a subset of m features is conducted bymeans of a sub-optimal algorithm. Among such algorithms weconsider (for its simplicity) the sequential forward selection (SFS), which is based on the following steps:

    – initialize S* = ;

    – compute the value of the functional for all the subset S*   {xi},

    with xi  S*, and choose the  feature x*  S *, that corresponds tothe maximum value of

    – update S* setting S* = S*  {x*};

    – continue by iteratively adding one  feature  at a time until S*reaches the desired cardinality m  or until the value of the

    functional stabilizes (reaches saturation).

    ( * { });iS x

  • 8/9/2019 Feature reduction.pdf

    13/43

    13

    Remarks on SFS

    • SFS identifies the optimum subset that can be obtained byiteratively adding a single feature at time.

    – At the first step the single  feature  that corresponds to themaximum value of the functional is chosen. At the second step,the  feature  that, coupled with the previous one, provides themaximum value of the functional is added. And so on...

    – The method is sub-optimal. For example, the optimal couple of features does not always include the single optimal feature .

    • Advantage

    – SFS is not computationally heavy even if X contains hundreds of features.

    • Disadvantage

    – A  feature  that has been included in the selected subset S* at aspecific iteration cannot be removed during the followingiterations, it means that SFS doesn’t allow backtracking.

  • 8/9/2019 Feature reduction.pdf

    14/43

    14Sequential Backward Selection  

    • Sequential backward selection  (SBS)  proceed in a dual way wrtSFS, initializing S* = X and eliminating a single feature at a time

    from S*, to maximize the functional   (S) step by step.• Disadvantages

    – Like SFS, SBS, too, doesn’t  allow backtracking: the  feature , eliminated from S* at a specific iteration, will never be recoveredin the following steps;

    – Usually SBS is computationally disadvantageous wrt SFS: whileSFS starts from an empty subset and adds a feature at a time, SBSstarts from the original  feature  space. Therefore, SBS computesvalues of the functional in spaces with much higher dimensionsthan SFS. However, it’s advantageous if m  n.

    • In literature other, more complex, methods have beenproposed (which we will not see) to search for suboptimalsubsets, which allow also backtracking [Serpico et al. , 2001].

  • 8/9/2019 Feature reduction.pdf

    15/43

    15

    Operational Aspects of Features Selection

    • The computation of inter-classes distance measures, used

    in features selection, requiresthe knowledge of the classconditional pdfs and of theclass prior probabilities.

    – Usually, such pdfs are not a

     priori  known but should beestimated from a training set , by means of parametric or non-parametric methods.

    – Globally, a classificationsystem that involves a feature

    selection step can besummarized by the followingflowchart.

    Training set Data set

    Class

    conditional pdfs

    and class prior

    probabilities

    Feature

    selection

    { p(x| i ), P i }

    training  

    samples foreaxh class {i }

     Application of the

    classifier to the

    data set  

    Classification of the data set  

    S* Training of the

    classifier

  • 8/9/2019 Feature reduction.pdf

    16/43

    16

    Example

    • Hyperspectral data set with 202 features and 9 classes. 

    2

    4

    6

    8

    2 8 14 20 26 32 38 44 50 56

    m

    Bav e

    RGB compositionof three of the 202 bands acquired by the sensor.

    Map of theground truth

    that highlightsthe training pixel

    Estimated probability of correctclassification for a MAP

    classifier under the hypothesisof Gaussian classes.

    50%

    60%

    70%

    80%

    90%

    100%

    0 50 100 150 200

    m

         O     A

    Pc ,max = 88.6%for m = 40 

  • 8/9/2019 Feature reduction.pdf

    17/43

    17

    Feature Extraction

    • Problem definition:

    – Given a set X = {x1 , x2 , … , xn} of n features , we want to identify alinear  transformation  that provides a transformed set of m

     features  Y ={ y1 ,  y2 , … ,  ym} (with m

  • 8/9/2019 Feature reduction.pdf

    18/43

    18

    Extraction Based on Inter-Class Distances

    • Considering again the Bhattacharyya distance , in the case of twoGaussian classes , we look for the orthonormal  feature 

    transformation that maximize the distance in the transformedspace.

    – Let miY  = E{y| i} = T·mi and i

    Y  = Cov{y | i} = T·i·T t (for i = 1,

    2), B in the transformed space Y  is given by:

    • In the expression of B , two distinct contributions Bm(Y ) and B(Y )appear, respectively linked to the conditional means  and to theconditional covariance matrices.

    1 2

    11 2

    2 1 2 1

    1 2

    21 1( ) tr ( )( ) ln

    8 2 2

    Y Y 

    Y Y Y Y Y Y t

    Y Y B Y 

     

    m m m m

    ( )mB Y  ( )B Y 

  • 8/9/2019 Feature reduction.pdf

    19/43

    19

    Extraction Based on Inter-Class Distances

    • In principle, we would search for the orthogonal matrix T  thatwould maximize B(Y ). However:

    – The general problem of the maximization of B(Y ) with respect toT  has no closed-form solution.

    – The problems of separately  maximizing Bm(Y ) or BΣ(Y ) haveclosed form solutions (eigenproblems). Details can be found in[Fukunaga, 1990].

    – Therefore, if one of the two contributions is largely dominantover the other (i.e., Bm(Y ) >> BΣ(Y ) or Bm(Y )

  • 8/9/2019 Feature reduction.pdf

    20/43

    20

    Linear discriminant analysis

    • A popular method for feature extraction is linear discriminantanalysis (LDA , aka discriminant analysis feature extraction,

    DAFE), which maximizes a measure of separation andcompactness of the classes directly defined on the training set.

    – Although explicit parametric assumptions are not stated, DAFEis usually considered parametric , because it “works poorly,”  forexample, with multimodal classes and it characterizes the classesonly through first and second-order moments.

    – Anyway, nonparametric extensions of this method have beenrecently introduced.

    • The method can be applied to both binary and multiclass

    problems.– Focusing first to the case of two classes , 1  and 2 , the linear

    discriminant analysis provides an optimum scalar projection,named Fisher transform.

  • 8/9/2019 Feature reduction.pdf

    21/43

    21

    DAFE: Fisher transform

    • In general, although the classes are well separated in theoriginal n-dimensional space, they may not be such in a

    transformed one-dimensional space , because the projectioncan overlay samples drawn from different classes.

    • The problem is to find the orientation of the projection linethat provides the best separation between the two classes.

    – Given a set {x1 , x2 , … , xN } of N  pre-classified samples, let Di be thesubset of the samples assigned to i  (i  = 1, 2) and let N i  be thecardinality of Di (obviously N  = N 1 + N 2).

    – A transformation y = wtx projects the sample xk to yk = wtxk.Let Ei 

    = { y = wtx: x  Di}.

    – We search for the transformation  y  = wtx  that maximizes theinter-class separation and minimizes the intra-class dispersion,conveniently quantified.

  • 8/9/2019 Feature reduction.pdf

    22/43

    22

    Inter-class separation and intra-class dispersion

    • First, a functional that measures inter-class separation anddispersion inside each class is necessary.– As a measure of inter-class separation , the difference between the

    centroids of the samples in the transformed space is used:

    – As a measure of class dispersion  around the centroids, thescatter values are adopt in the transformed space, i.e.:

    1

     , 1,2

    1

    i

    i

    iDi t

    i i

    i y Ei

    N i

     yN 

    x

    x

    w

     

    2

    2 2

    ( )( )

     , 1,2( )

    i

    i

    ti i i

    D ti i

    i i y E

    S

    s S is y

     

    x

    x x

    w w

     Si is called scattermatrix of the class

    i (i = 1, 2). 

  • 8/9/2019 Feature reduction.pdf

    23/43

    23

    The Fisher Functional

    • The goal of the Fisher transform is to maximize the distance between the centroids of the classes and to minimize the

    scatters in the one-dimensional transformed space.– For this purpose, the following Fisher functional is introduced:

    – Let us explicitly write the functional as a function of w:

    where Sb = ( 1 –  2)( 1 –  2)t is named between class scatter matrix 

    and Sw = S1 + S2 is named within class scatter matrix.

    2

    1 22 21 2

    ( )s s

    w

     

    2 21 2 1 2

    22

    1 2 1 2 1 2 1 2

    ( )

    ( ) ( )( )

    ( ) ,

    t tw

    t t t tb

    t

    bt

    w

    s s S S S

    S

    SS

    w w w w

    w w w w w

    w www w

     

    2

  • 8/9/2019 Feature reduction.pdf

    24/43

    24

    Optimality condition for the Fisher functional

    • Optimality condition – Through the usual zero-gradient condition, one may prove that

    the vector w* that maximizes the Fisher functional is aneigenvector of the product matrix Sw

    – 1Sb:

    – where  is the corresponding eigenvalue.

    • Close form solution– Therefore,w* satisfies the condition:

    – ( 1 –  2)

    tw* and   are scalars, so w* is parallel to Sw– 1(

     1 –  2).Since the scale factors are irrelevant in linear projections, we

    obtain the following closed-form solution (with no need forexplicitly computing eigenvectors):

    – Typically, the vector w* is also normalized.

    1( ) * , i.e., ( ) * ,w b b wS S I S Sw 0 w 0

    1 11 2 1 2* * ( )( ) * *

    tw wbS S S w w w w 

    11 2* ( )wS

    w  

    25

  • 8/9/2019 Feature reduction.pdf

    25/43

    25

    DAFE: multiclass Fisher transform

    • We extend the discriminant analysis from the binary case tothe case of  M  classes  1 , 2 , … ,  M  and of an m  × n 

    transformation matrix.– Let us consider a set {x1 , x2 , … , xN } of N   preclassified samples,

    denote as Di the subset of the samples assigned to i (i = 1, 2, … , M) and as N i the cardinality of Di (N  = N 1 + N 2 + … + N  M).

    – The transformation y = T x maps xk

     to yk

     = T xk

    . Given Ei

     = {wtx: x  Di}, let us define:

    ( )( )

     , 1,2, ...,( )( )

    i

    i

    ti i i

    D ti it

    i i iE

    S

    S TS T i MS

     

    x

    y

    x x

    y y

     

    1

     , 1,2, ...,1

    i

    i

    iDi

    i i

    i

    Ei

    N T i M

     

    x

    y

    x

    y

     

    Centroids of i inthe original and

    transformed spaces.

    Scatter matrices ofi in the originaland transformed

    spaces.

    26

  • 8/9/2019 Feature reduction.pdf

    26/43

    26

    DAFE: multiclass Fisher functional (1)

    • Let us extend the Fisher functional to the multiclass case.

    – In the multiclass case, we quantify inter-class separation throughthe mean differences between the centroids of the classes and thecentroid of the entire training set in the transformed space:

    – We measure the dispersions inside the single classes by means ofthe scatter matrices in the transformed space.

    – Then, the Fisher functional is generalized as follows:

    1

    1

    ( )( )

    ( )

     Mt

    i i i

    i

     M

    ii

    S

     

    1 1 1

    1 1 1 , where:

    N N M

    k k i ik k i

    T N N N N 

    y x 

    27

  • 8/9/2019 Feature reduction.pdf

    27/43

    27

    DAFE: multiclass Fisher functional (2)

    • Let us explicitly write the Fisher functional as a function of theunknown transformation matrix T .

    – Let us express numerator and denominator as functions of T  andlet us consequently introduce a within class scatter matrix Sw anda between class scatter matrix Sb:

    1 1 1

    1 1

    1

     , where:

    ( )( ) ( )( )

    where: ( )( )

    ( )

     M M Mt t

    i i w w i

    i i i M M

    t t t ti i i i i i b

    i i

     Mt

    b i i i

    it

    b

    tw

    S T S T TS T S S

    N T N T TS T  

    S N 

    TS T T 

    TS T 

     

    28

  • 8/9/2019 Feature reduction.pdf

    28/43

    28

    Optimality condition for the multiclass case

    • Optimality condition 

    – Again through a zero-gradient condition, one may prove that the

    row vectors e1 , e2 , … , em  of the matrix T * that maximizes theFisher functional are eigenvectors of Sw

    – 1Sb:

    where i is the eigenvalue corresponding to ei and is nonzero.

    • Remarks– The  M matrices ( i –  )( i –  )

    t , i = 1, 2, … ,  M , have unit ranks.Because of the linear relationship among the overall centroidand the class centroids i , i  = 1, 2, … ,  M , they are also linearlydependent.

    – Thus, rank(Sb)   M – 1 and, then, rank(Sw– 1Sb)  rank(Sb)   M – 1.

    – Therefore, at most ( M – 1) eigenvalues of Sw-1·Sb are nonzero, i.e.,

    the eigenvector equation provides at most ( M  –  1) solutionvectors.

    1( ) , i.e., ( ) , 1,2,..., ,w b i i b i w iS S I S S i me 0 e 0

    29

  • 8/9/2019 Feature reduction.pdf

    29/43

    29

    DAFE: comments

    • DAFE allows up to ( M – 1) transformed features to be linearlyextracted (remember that M is the number of classes).

    • Operational issues

    – The eigenvalues of Sw-1·Sb  can be computed as the roots of the

    characteristic polynomial, i.e.,:

    – The second formulation is more convenient because it does notrequire any matrix inversion.

    – The characteristic equation provides at most ( M  –  1) nonzeroroots 1 , 2 , … ,  M – 1 and at least (n –  M + 1) zero solutions.

    – An eigenvector ei is computed from each resulting nonzeroeigenvalue i.

    – The optimal transformation matrix T * is obtained through a row juxtaposition of the resulting eigenvectors.

    1 0 or, equivalently: 0w b b wS S I S S

    30

  • 8/9/2019 Feature reduction.pdf

    30/43

    30

    Principal component analysis

    • The principal component analysis (PCA , or Karhunen-Loevetransform, KL) is an unsupervised algorithm for feature

    extraction. In particular, PCA reduces the dimension of thefeature space on the basis of a mean square error criterion.

    • Problem setting

    – Let a data set {x1 , x2 , … , xN } composed of N  samples be given.

    – A coordinate system in the n-D feature space is determined by anorthonormal basis {e1 , e2 , … , en} and by an origin c.

    – In such a coordinate system each sample is expressed as:

    – To reduce the dimension of the feature space, one could keeponly m components:

    however it is not obvious that the m  components  yik  be theprojections of ( xk-c) along ei.

    1

     , 1,2, ...,n

    ik iki

     y k N 

    x c e

    1

     , 1,2,...,m

    ik iki

     y k N 

    x c e

    31

  • 8/9/2019 Feature reduction.pdf

    31/43

    31

    Geometric interpretation

    • Two-dimensional example 

    – Approximation of the samples in a two-dimensional feature

    space (plane) as the sum of a constant vector c  and of thecomponent along one unit vector e1.

    c

    e1 

    x1

    x2

    32

  • 8/9/2019 Feature reduction.pdf

    32/43

    32

    PCA: mean square error

    • If the components of xk  along (n –  m) axes are discarded, anerror is obviously introduced. PCA selects the coordinate

    system that minimizes the mean square error.– The adopted functional is:

    – This functional has to be minimized  with respect to all relatedvariables, i.e., the origin c , the vectors ei , and the components yik ,under the following orthonormality constraint:

    – Plugging this constraint in the expression of the functional yields:

    22

    1 1 1

    1 1N N m

    ik k k ikk k i

     yN N 

    x x x c e

      , 1,2,...,ti j ij   i j m e e

    2 2

    1 1 1

    12 ( )

    N m mtik ik k ik

    k i i

     y yN   

    x c e x c

    33

  • 8/9/2019 Feature reduction.pdf

    33/43

    33

    PCA: optimal components of the samples

    • Let us compute, first, the optimum components of the samplesalong the first m components of the basis {e1 , e2 , ..., en}

    (unconstrained minimization).– The stationarity of the functional with respect to each component

     yik yields:

    where bi = eitc is the component of c along the ith unit vector ei in

    the unknown orthonormal basis (i = 1, 2, ..., m).

    – Plugging this optimal values into    allows obtaining:

    0 ( ) , 1,2,..., ,t tik i k i k i

    ik

     y b k N  y

    e x c e x

    2 2

    1 1

    2 2

    1 1 1

    2 2

    1 1 1 1

    1[ ( )]

    1[ ( )] [ ( )]

    1 1[ ( )] ( )

    N mt

    k i k

    k i

    N n mt ti k i k

    k i i

    N n N nt ti k i k i

    k i m k i m

    bN N 

    x c e x c

    e x c e x c

    e x c e x

    34

  • 8/9/2019 Feature reduction.pdf

    34/43

    34

    PCA: optimal origin

    •      depends now on the origin c only through the components

    bm + 1 , bm + 2 , ..., bn of c along em + 1 , em + 2 , ..., en.

    – The zero-gradient condition with respect to bi (i = m + 1, m + 2, ...,n) yields:

    – Consequently:

    Centroid of the data set 

    1

    1 1

    2( ) 0

    1 1 , where:

    N t

    i i kki

    N N 

    t ti i k i k

    k k

    bb N 

    bN N 

    e x

    e x e x 

    2 2

    1 1 1 1

    1 1 1

    1

    1 1( ) [ ( )]

    1( )( )

    1where: ( )( )

    N n N nt t ti k i i k

    k i m k i m

    N n N t t ti k k i i i

    k i m i m

    N t

    k kk

    N N 

    e x e e x

    e x x e e e

    x x

     

    Sample-covariance ofthe data set 

    35

  • 8/9/2019 Feature reduction.pdf

    35/43

    35

    PCA: optimal orthonormal basis

    • The vectors ei  (i = 1, 2, ..., n) are supposed to be orthonormal,so their optimization is a constrained problem.

    – Optimum vector basis i:

    – The sample covariance  is symmetric and positive semidefinite.Therefore, it has n real nonnegative eigenvalues 1 , 2 , … , n withcorresponding orthonormal eigenvectors e1 , e2 , … , en.

    – To establish which m  eigenvectors should be preserved (andwhich (n  –  m) should be discarded), let us plug the obtained

    optimal values in the expression of the functional. This yields thefollowing minimum mean square error:

     

      2

    min2 2 ( )

    1

    i

    ti i

    i i i i it

    i i i

    I ee e

    e e 0 e 0

    e e e

    Through Lagrangemultipliers 

    1 1

    *n n

    ti i i i

    i m i m

    e e

    36

  • 8/9/2019 Feature reduction.pdf

    36/43

    36

    PCA: feature reduction

    • Therefore, the minimum value of   * is obtained if m + 1 , m + 2 ,

    … , n  are the smallest eigenvalues, i.e., if the preserved unit

    vectors e1 , e2 , ..., em correspond to the m largest eigenvalues 1 ,2 , … , m.

    • Expression of the PCA transformation 

    – If the n eigenvalues of  are ordered in decreasing order (i.e., 1  

    2    …    n), the PCA transformation projects the samples(centered with respect to the centroid) along the axes e1 , e2 , … , em corresponding to the first m eigenvalues:

    1

    2( ) , ( ) with:

    t

    tt

    ik i k k k

    tm

     y i k T T 

    e

    ee x y x

    e

     

    37

  • 8/9/2019 Feature reduction.pdf

    37/43

    37

    PCA: remarks

    • Operatively , PCA is applied as follows:

    – Compute the centroid and the sample-covariance   of the

    whole data set.

    – Compute the eigenvalues and the eigenvectors of .

    – Order the eigenvalues in decreasing order.

    – Compute the matrix T   through the row juxtaposition of the

    eigenvectors corresponding to the first m eigenvalues.• Remarks

    – Therefore, the PCA transformation is y = T (x –  ).

    – According to the expression of the minimum mean square error,the information loss due to feature reduction through PCA isoften quantified through the following efficiency factor :

    1

    1

    m

    i

    i

    n

    i

    i

    80%

    85%

    90%

    95%

    100%

    1 6 11 16

    m

      m*

    38

    f h l

  • 8/9/2019 Feature reduction.pdf

    38/43

    38

    PCA: interpretation of the principal components

    • The eigenvalue i  represents the sample-variance along theaxis ei (i = 1, 2, … , n).

    – The components along the axis e1 , e2 , … , en are named principalcomponents. Therefore, one may say that “PCA  preserves thefirst m principal components.” 

    – Geometrically, e1 is the direction along which the samples exhibitthe maximum dispersion and en  is the direction along which the

    sample dispersion is lowest.

    – Since the transformed features associated with maximumdispersion are chosen, PCA implicitly assumes that informationis conveyed by the variance of the data (see the figure in slide 31).

    39

    PCA k h i i l

  • 8/9/2019 Feature reduction.pdf

    39/43

    39

    PCA: remarks on the principal components

    • Choosing features related to maximum dispersion does notimply choosing features that well discriminate the classes.

    – In this 2D example, separation between the classes is poor withonly the first PCA component  y1 , while considering both  y1 and

     y2 yields better separation:

    – Indeed, PCA does not use information about class membership ofthe samples. If a training set is available, it is convenient to use asupervised  feature extraction method (e.g., LDA or moresophisticated approaches).

    x1

    x2

    O

    e1

    e2

    1

    2

    40

    E l (1)

  • 8/9/2019 Feature reduction.pdf

    40/43

    Example (1)

    • Apply PCA to the following samples: (0, 0, 0), (1, 0, 0), (1, 0, 1),(1, 1, 0), (0, 0, 1), (0, 1, 0), (0, 1, 1), (1, 1, 1).

    – Transformation matrix for the extraction of two features:

       

     

    1

    2 3

    1 2 3

    1/ 2 2 1 111

    8, 1/ 2 1 2 11/ 44

    1/ 2 1 1 2

    1 2 01 1 1

    1 , 1 , 13 6 2

    1 1 1

    1 1 1

    3 3 3

    2 1 1

    6 6 6

    e e e

     

    41

    E l (2)

  • 8/9/2019 Feature reduction.pdf

    41/43

    Example (2)

    • Compute the transformed samples:

    – Subtraction of the centroid from the samples:

    – Transformed samples:

    1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2

    1/ 2 , 1/ 2 , 1/ 2 , 1/ 2 , 1/ 2 , 1/ 2 , 1/ 2 , 1/ 2

    1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2

    1 2 3 4

    5 6 7 8

    1 1 13

    2 3 2 3 2 3 , , , ,2 3

    2 1 10

    6 6 6

    1 1 1 32 3 2 3 2 3

     , , , 2 31 1 2

    06 6 6

     y y y y

     y y y y

     

     

    42

    S l ti t ti

  • 8/9/2019 Feature reduction.pdf

    42/43

    Selection vs. extraction

    • Advantage of extraction methods– An extraction method projects the feature space onto a subspace

    such that the maximum information is preserved, and isconsequently more flexible  (indeed, selection is a particular caseof extraction).

    • Advantage of selection methods– The features provided by a selection method are a subset of the

    original ones. Therefore, they maintain their physical meanings.This is relevant when information about the interpretations of thefeatures are used in the classification process (e.g., knowledge-

     based methods).

    – On the contrary, an extraction method generates “virtual” features, which are defined as linear combinations of the“measured”  original features and usually have well definedmathematical meanings but not physical meanings.

    – Through selection, the discarded features are not necessary. Withextraction, one usually needs using all the original features (e.g.,to compute linear combinations).

    43

    Bibli h

  • 8/9/2019 Feature reduction.pdf

    43/43

    Bibliography

    • R. O. Duda, P. E. Hart, D. G. Stork,Pattern Classification , 2nd Edition. NewYork: Wiley, 2001.

    • K. Fukunaga, Introduction to statistical pattern recognition , 2nd edition,Academic Press, New York, 1990.

    • G. Hughes, "On the mean accuracy ofstatistical pattern recognizers", IEEETransactions on Information Theory , vol.

    14, no. 1, pp. 55-63, 1968.• L. O. Jimenez, D. A. Landgrebe,

    "Supervised classification in high-dimensional space: geometrical,statistical, and asymptotical propertiesof multivariate data", IEEE Transactionson Systems, Man and Cybernetics, Part C ,

    vol. 28, no. 1, pp. 39-54, 1998.• Harry C. Andrews, Introduction to

     Mathematical Techniques in PatternRecognition , Wiley International, NewYork., 1972.

    • P. H. Swain and S.M. Davis, Remotesensing: the quantitative approach ,McGraw-Hill, New York, 1978.

    •  J. A. Richards, X. Jia, Remote sensingdigital image analysis , Springer-Verlag,Berlin, 1999.

    • S. B. Serpico and L. Bruzzone, “A NewSearch Algorithm for Feature Selectionin Hyperspectral Remote Sensing

    Images”, IEEE Transaction on Geoscienceand Remote Sensing , vol. 39, pp. 1360-1367, 2001.

    • L. O. Jimenez and D. A. Landgrebe,“Hyperspectral  Data Analysis andFeature Reduction Via ProjectionPursuit”,  IEEE Transactions on

    Geoscience and Remote Sensing. vol. 37,pp. 2653-2667, 1999.