lecture 7 principal components analysis the curse of ... · lecture 7 principal component analysis...

Lecture 7

• Principal Component Analysis

– Dimensionality reduction via feature extraction. Emphasis is onrepresenting the original signal as accurately as possible in the lowerdimensional space.

• Linear Discriminant Analysis

– Dimensionality reduction via feature extraction. Emphasis is toenhance the class-discrimination in the lower dimensional space.

Principal Components Analysis

• The curse of dimensionality

The curse of dimensionality


– A term coined by Richard Bellman in 1961,– Refers to the problems associated with multivariate data analysis as

the dimensionality increases.

Consider a 3-class pattern recognition problem

• A simple approach would be to

– Divide the feature space into uniform bins

– Compute the ratio of examples for each class at each bin and,

– For a new example, find its bin and choose the predominant class in that bin

• In our toy problem we decide to start with one single feature and dividethe real line into 3 segments

Introduction to Pattern AnalysisRicardo Gutierrez-OsunaTexas A&M University

2

! The curse of dimensionality" A term coined by Bellman in 1961" Refers to the problems associated with multivariate data analysis as the

dimensionality increases" We will illustrate these problems with a simple example

! Consider a 3-class pattern recognition problem" A simple approach would be to

! Divide the feature space into uniform bins! Compute the ratio of examples for each class at each bin and, ! For a new example, find its bin and choose the predominant class in that bin

" In our toy problem we decide to start with one single feature and divide the real line into 3 segments

" After doing this, we notice that there exists too much overlap among the classes, so we decide to incorporate a second feature to try and improve separability

The curse of dimensionality (1)

x1x1

• After doing this, we notice that there exists too much overlap amongthe classes, so we decide to incorporate a second feature to try andimprove separability

We decide to preserve the granularity of each axis, which raises the numberof bins from 3 (in 1D) to 32 = 9 (in 2D)

• Need to make a decision: Maintain the density of examples per binor keep the number of examples as for the 1D case ?

– Maintaining the density increases the number of examples from 9 (in 1D) to 27

(in 2D)

– Maintaining the number of examples results in a very sparse 2D scatter plot


3

! We decide to preserve the granularity of each axis, which raises the number of bins from 3 (in 1D) to 32=9 (in 2D)

" At this point we need to make a decision: do we maintain the density of examples per bin or do we keep the number of examples had for the one-dimensional case?

! Choosing to maintain the density increases the number of examples from 9 (in 1D) to 27 (in 2D)! Choosing to maintain the number of examples results in a 2D scatter plot that is very sparse

! Moving to three features makes the problem worse:" The number of bins grows to 33=27" For the same density of examples the number of needed

examples becomes 81" For the same number of examples, well, the 3D scatter

plot is almost empty


x1

x2

Constant density

x1

x2

x3

x1

x2

Constant # examples

Moving to three features makes the problem worse:

• The number of bins grows to 33 = 27

• For the same density of examples the number of needed examples becomes 81

• For the same number of examples, well, the 3D scatter plot is almost empty


3

! We decide to preserve the granularity of each axis, which raises the number of bins from 3 (in 1D) to 32=9 (in 2D)

" At this point we need to make a decision: do we maintain the density of examples per bin or do we keep the number of examples had for the one-dimensional case?

! Choosing to maintain the density increases the number of examples from 9 (in 1D) to 27 (in 2D)! Choosing to maintain the number of examples results in a 2D scatter plot that is very sparse

! Moving to three features makes the problem worse:" The number of bins grows to 33=27" For the same density of examples the number of needed

examples becomes 81" For the same number of examples, well, the 3D scatter

plot is almost empty


x1

x2

Constant density

x1

x2

x3

x1

x2

Constant # examples

The curse of dimensionality

• This approach to divide the sample space into equally spaced bins wasquite inefficient

– There are other approaches that are much less susceptible to the curse of

dimensionality, but the problem still exists.

• How do we beat the curse of dimensionality?

– By incorporating prior knowledge

– By providing increasing smoothness of the target function

– By reducing the dimensionality

• In practice, the curse of dimensionality means that, for a givensample size, there is a maximum number of features above whichthe performance of our classifier will degrade rather than improve

– In most cases, the additional information that is lost by discarding some features

is (more than) compensated by a more accurate mapping in the lower dimensional

space


4

The curse of dimensionality (3)! Obviously, our approach to divide the sample space into equally

spaced bins was quite inefficient" There are other approaches that are much less susceptible to the curse of

dimensionality, but the problem still exists! How do we beat the curse of dimensionality?

" By incorporating prior knowledge" By providing increasing smoothness of the target function" By reducing the dimensionality

! In practice, the curse of dimensionality means that, for a given sample size, there is a maximum number of features above which the performance of our classifier will degrade rather than improve

" In most cases, the additional information that is lost by discarding some features is (more than) compensated by a more accurate mapping in the lower-dimensional space

dimensionality

perfo

rman

ce

The curse of dimensionalityThere are many implications of the curse of dimensionality

• Exponential growth in the number of examples required to maintain agiven sampling density

– For a density of N examples/bin and D dimensions, the total number of examples

is ND

• Exponential growth in the complexity of the target function (a densityestimate) with increasing dimensionality

– “A function defined in high-dimensional space is likely to be much more complex

than a function defined in a lower-dimensional space, and those complications are

harder to discern” - Friedman

This means that, in order to learn it well, a more complex targetfunction requires denser sample points!

• What to do if it isn’t Gaussian?

– For one dimension a large number of density functions can be found in textbooks,

but for high-dimensions only the multivariate Gaussian density is available.

Moreover, for larger values of D the Gaussian density can only be handled in

a simplified form!

• Humans have an extraordinary capacity to discern patterns and clustersin 1, 2 and 3-dimensions, but these capabilities degrade drastically for4 or higher dimensions.



• Dimensionality reduction & Feature selection vs. featureextraction

Dimensionality reduction (1)Two approaches to dimensionality reduction

• Feature extraction: create a subset of new features by combinationsof the existing features

• Feature selection: choose a subset of all the features (the moreinformative ones)

x1

x2

...

xN

−→

xi1

xi2

xiM

︸︷︷︸

feature selection

,

x1

x2

...

xN

−→

y1

y2

yM

= f

x1

x2

...

xN

︸︷︷︸

feature extraction

The problem of feature extraction can be stated as

• Given a feature space x ∈ RN find a mapping

y = f(x) : RN → RM with M < N

such that the transformed feature vector y ∈ RM preserves (most of)the information or structure in RN .

• An optimal mapping y = f(x) will be one that results in no increasein the minimum probability of error.

– That is, a Bayes decision rule applied to the initial space RN and to the reduced

space RM yield the same classification rate.

Dimensionality reduction (2)Generally, the optimal mapping y = f(x) is a non-linear function

• However, there is no systematic way to generate non-linear transforms

– The selection of a particular subset of transforms is problem dependent

• For this reason, feature extraction is commonly limited to lineartransforms: y = Wx

– This is, y is a linear projection of x– When the mapping is a non-linear function, the reduced space is called a manifold.0BBBBBBB@

x1

x2

...

xN

1CCCCCCCA−→

0BB@y1

y2

yM

1CCA =

0BB@w11 w12 · · · w1N

w21 w22 · · · w2N... ... . . . ...

wM1 wM2 · · · wMN

1CCA0BBBBBBB@

x1

x2

...

xN

1CCCCCCCA



• Dimensionality reduction

• Feature selection vs. feature extraction

• Signal representation vs. signal classification

Signal Representation Vs Classification

• The selection of the feature extraction mapping y = f(x) is guided byan objective function that we seek to maximize (or minimize)

• Depending on the criteria used by the objective function, featureextraction techniques are grouped into two categories:

Signal representation The goal of the

feature extraction mapping is to

represent the samples accurately in a

lower-dimensional space.

Classification The goal of the feature

extraction mapping is to enhance the

class-discriminatory information in the

lower-dimensional space.


8

Signal representation versus classification! The selection of the feature extraction mapping y=f(x) is guided by an

objective function that we seek to maximize (or minimize)! Depending on the criteria used by the objective function, feature

extraction techniques are grouped into two categories:" Signal representation: The goal of the feature extraction mapping is to represent

the samples accurately in a lower-dimensional space" Classification: The goal of the feature extraction mapping is to enhance the

class-discriminatory information in the lower-dimensional space! Within the realm of linear feature

extraction, two techniques are commonly used

" Principal Components Analysis (PCA)! uses a signal representation criterion

" Linear Discriminant Analysis (LDA)! uses a signal classification criterion

Feature 1

Feat

ure

2

Classif

icatio

n

Signal representation

1

2

111 1 1

1111 1

11

11

11 11

11

22

2 22

22

22 222 22 2 2

2

2

• Within the realm of linear feature extraction, two techniques arecommonly used

– Principal Components Analysis (PCA) - uses a signalrepresentation criterion

– Linear Discriminant Analysis (LDA) - uses a signal classificationcriterion



• Dimensionality reduction

• Feature selection vs. feature extraction

• Signal representation vs. signal classification

• Principal Components Analysis

Intuitive Motivation

Want to encode as accurately as possible the position of the m pointsin this cluster. Can do so exactly with their x, y-coordinate locations ofwhich there are 2m. However, say I only have bandwidth to record m + 4numbers. Intuitively, what should these numbers represent ?


Devote 2 to the center of mass of the points.


Use 2 numbers to define a direction which corresponds to thedirection in which there is most variation.


Let the other m numbers represent the distance of the pointprojected onto the line from the centre point.

Principal Components Analysis, PCA

The objective of PCA is to perform dimensionality reduction whilepreserving as much of the randomness (variance) in the high-dimensionalspace as possible

• Let x be an N -dimensional random vector, represented as a linearcombination of orthonormal basis vectors (φ1,φ2, · · · ,φN) as

x =NX

i=1

yiφi, where φTi φj = δij

• Suppose we choose to represent x with only M(M < N) of the basisvectors. Do this by replacing the components (yM+1, · · · , yN)T withsome pre-selected constants bi

x(M) =MXi=1

yiφi +NX

i=M+1

biφi

• The representation error is then

∆x(M) = x−x(M) =NX

i=1

yiφi−

0@ MXi=1

yiφi +NX

i=M+1

biφi

1A =NX

i=M+1

(yi−bi)φi

• We can measure this representation error by the mean-squaredmagnitude of ∆x.

• Our goal is to find the basis vectors φi and constants bi that minimizethis mean-square error

ε2(M) = E

h|∆x(M)|2

i= E

24 NXi=M+1

NXj=M+1

(yi − bi)(yj − bj)φTi φj

35=

NXi=M+1

Eh(yi − bi)

2i

, as the φi’s are orthonormal

PCA (2)• The optimal values of bi can be found by computing the partial derivative of the

objective function and setting it to zero

∂

∂bi

Eh(yi − bi)

2i

= −2 (E [yi]− bi) = 0 =⇒ bi = E [yi]

– Therefore, we will replace the discarded dimensions yi’s by their expected value

• The mean-square error can then be written as

ε2(M) =

NXi=M+1

Eh(yi − E [yi])

2i

=NX

i=M+1

E

»“φ

Ti x− E

hφ

Ti x

i” “φ

Ti x− E

hφ

Ti x

i”T–

=NX

i=M+1

φTi E

h(x− E [x]) (x− E [x])

Ti

φi =NX

i=M+1

φTi Σxφi

• We seek to find the solution that minimizes this expression subject to the ortho-

normality constraint, which we incorporate into the expression using a set of Lagrange

multipliers λi

ε2(M) =

NXi=M+1

φTi Σxφi +

NXi=M+1

λi(1− φTi φi)

• Computing the partial derivative with respect to the basis vectors

∂

∂φi

ε2(M) =

∂

∂φi

24 NXi=M+1

φTi Σxφi +

NXi=M+1

λi(1− φTi φi)

35= 2 (Σxφi − λiφi) , as

∂xTAx∂x

= (A + AT)x

= 0

=⇒ Σxφi = λiφi

So φi and λi are the eigenvectors and eigenvalues of the covariance matrixΣx

PCA (3)• We can then express the sum-square error as

ε2(M) =

NXi=M+1

φTi Σxφi =

NXi=M+1

φTi λiφi =

NXi=M+1

λi

• To minimize this measure, choose λi’s to be the smallest eigenvalues.

Therefore, to represent x with minimum sum-square error, choose theeigenvectors φi corresponding to the largest eigenvalues λi.

PCA Dimensionality ReductionThe optimal approximation of a random vector x ∈ RN by a linearcombination of M(M < N) independent vectors is obtained byprojecting the random vector x onto the eigenvectors φi correspondingto the largest eigenvalues λi of the covariance matrix Σx.

Basic PCA - Implementation Guide

• Given m data points xi each of dimension N .

• Compute the mean µ = 1m

∑mi=1 xi and subtract it from each data

point.xc

i = xi − µ

• Compute the data matrix X where each column is a data point xci .

• Compute the covariance matrix, Σ = 1mXXT

• Find the eigen-vectors and -values of Σ.

• The principal components are the k eigen-vectors with highest eigen-values.

Remember SVD

Any arbitrary m × n matrix X can be converted to the product of anorthogonal matrix, a diagonal matrix and another orthogonal matrix viasingular value decomposition:

X = USV T

where

S ∼ a diagonal matrix (m× n) and contains the singular values

U ∼ a square (m×m) orthogonal matrix

V ∼ a square (n× n) orthogonal matrix

SVD PCA - Implementation Guide

• Given m data points xi each of dimension N .

• Compute the mean µ = 1m

∑mi=1 xi and subtract it from each data

point.xc

i = xi − µ

• Compute the data matrix X where each column is a data point xci .

• Let Y = 1√m

XT and perform SVD such that Y = WSV T .

• The principal components are the k singular vectors with highestsingular values (rows of V T ).

PCA (4)

NOTES

• Since PCA uses the eigenvectors of the covariance matrix Σx, it is ableto find the independent axes of the data under the unimodal Gaussianassumption

– For non-Gaussian or multi-modal Gaussian data, PCA simply de-correlates the

axes

• The main limitation of PCA is that it does not consider class separabilitysince it does not take into account the class label of the feature vector

– PCA simply performs a coordinate rotation that aligns the transformed axes with

the directions of maximum variance.

– There is no guarantee that the directions of maximum variance will

contain good features for discrimination.

Historical Remarks

• Principal Components Analysis is the oldest technique in multivariateanalysis

• PCA is also known as the Karhunen-Loeve transform (communicationtheory)

• PCA was first introduced by Pearson in 1901, and it experienced severalmodifications until it was generalized by Loeve in 1963

PCA example (1)


13

PCA example (1) ! In this example we have a three-dimensional

Gaussian distribution with the following parameters

! The three pairs of principal component projections are shown below

" Notice that the first projection has the largest variance, followed by the second projection

" Also notice that the PCA projections de-correlates the axis (we knew this since Lecture 3, though)

-10 -5 0 5 10

0510

-8

-6

-4

-2

0

2

4

6

8

10

12

x1x2

x3

-15 -10 -5 0 5 10 15

-5

0

5

10

15

y1

y3

-10 -8 -6 -4 -2 0 2 4 6 8 10

-2

0

2

4

6

8

10

12

y2

y3

-15 -10 -5 0 5 10 15

-10

-5

0

5

10

y1

y2

! "###

$

%

&&&

'

(

)))

)**

10474417125

!and250µ T

Have a 3-dimensional Gaussian distribution

with the following parameters

µ = (0, 5, 2)T, Σ =

0@ 25 1 7

−1 4 −4

7 −4 10

1A


13

PCA example (1) ! In this example we have a three-dimensional

Gaussian distribution with the following parameters

! The three pairs of principal component projections are shown below

" Notice that the first projection has the largest variance, followed by the second projection

" Also notice that the PCA projections de-correlates the axis (we knew this since Lecture 3, though)

-10 -5 0 5 10

0510

-8

-6

-4

-2

0

2

4

6

8

10

12

x1x2

x3

-15 -10 -5 0 5 10 15

-5

0

5

10

15

y1

y3

-10 -8 -6 -4 -2 0 2 4 6 8 10

-2

0

2

4

6

8

10

12

y2

y3

-15 -10 -5 0 5 10 15

-10

-5

0

5

10

y1

y2

! "###

$

%

&&&

'

(

)))

)**

10474417125

!and250µ T

The 3 pairs of the principal component projections - the first projectionhas largest variation. The PCA projections are de-correlated.

PCA example (2)

Compute the principal components for the following two-dimensionaldataset

x = {(x1, x2)} = {(1, 2), (3, 3), (3, 5), (5, 4), (5, 6), (6, 5), (8, 7), (9, 8)}

Solution First plot the data to get an idea of which solution we shouldexpect

PCA example (2) ctdCompute the principal components for the following two-dimensionaldataset

x = {(x1, x2)} = {(1, 2), (3, 3), (3, 5), (5, 4), (5, 6), (6, 5), (8, 7), (9, 8)}

Solution The plot of the data

PCA example (2) ctdSolution (by hand)

• The (biased) covariance estimate of the data is: Σx =

„6.25 4.25

4.25 3.5

«

• The eigenvalues are the zeros of the characteristic equation

Σxv = λv =⇒ |Σx − λI| = 0 =⇒ λ1 = 9.34, λ2 = .41

• The eigenvectors are the solutions of the system(6.25 4.254.25 3.5

)v1 = λ1v1 =⇒ v1 =

(.89.59

)What is v2 ?

Eigenfaces, PCA example

Let X = {x1,x2, · · · ,xm} be a collection of feature vectors. Each feature

vector xi ∈ [0, 255]N2

corresponds to the pixel values of a visual image(N ×N) of a face.


The mean face is µ = 1m

∑mi=1 xi.

Let x∗i = xi − µ.The eigenfaces are the PCA basisvectors. These are found by findingthe eigenvectors of

Σx =1

m

mXi=1

x∗i (x∗i )

T= (AA

T)

EigenfacesTurk & Pentland (1992)

mean and eigen-facesNote Σx may be huge (N2 ×N2). Need to use the trick of finding the eigenvectors of

ATA (of size m×m). If (ATA)vi = λivi then

ΣxAvi = (AAT)Avi = A(A

TA)vi = λiAvi =⇒ Aviis an eigenvector of Σx.


Standard Eigenfaces (← increasing eigenvalue)

Matlab example

• Effect of subtraction of the mean

Without mean

subtracted

With mean subtracted

Effect of the subtraction of the mean

Linear Discriminant Analysis

• Linear Discriminant Analysis, two classes

LDA, two-classes(1)

The objective of LDA is to perform dimensionality reduction while preserving as much of

the class discriminatory information as possible

• Assume we have a set of D-dimensional samples {x1, x2, · · · , xN}, N1 of which

belong to class ω1, and N2 to class ω2. We seek to obtain a scalar y by projecting

the samples x onto a line

y = wTx

• Of all the possible lines we would like to select the one that maximizes the separability

of the scalars

This is illustrated for the two-dimensional case in the following figures


2

Linear Discriminant Analysis, two-classes (1)! The objective of LDA is to perform dimensionality reduction while

preserving as much of the class discriminatory information as possible

" Assume we have a set of D-dimensional samples {x(1, x(2, …, x(N}, N1 of which belong to class !1, and N2 to class !2. We seek to obtain a scalar y by projecting the samples x onto a line

" Of all the possible lines we would like to select the one that maximizes the separability of the scalars

! This is illustrated for the two-dimensional case in the following figures

xwy T"

x1

x2

x1

x2

x1

x2

x1

x2

LDA, two-classes(2)

In order to find a good projection vector, we need to define a measure of separation

between the projections

• The mean vector of each class in x and y feature space is

µi =1

Ni

Xx∈ωi

x, µi =1

Ni

Xy∈ωi

y =1

Ni

Xx∈ωi

wTx = xTµi

• We could then choose the distance between the projected means as our objective

function

J(w) = |µ1 − µ2| = |wT(µ1 − µ2)|

• However, the distance between the projected means is not a very good measure since

it does not take into account the standard deviation within the classes.

Why such a measure may be too simplistic


3

Linear Discriminant Analysis, two-classes (2)! In order to find a good projection vector, we need to define a measure

of separation between the projections! The mean vector of each class in x and y feature space is

! We could then choose the distance between the projected means as our objective function

! However, the distance between the projected means is not a very good measure since it does not take into account the standard deviation within the classes

iT

!x

T

i!yii

!xii µwxw

N1y

N1µandx

N1µ

iii

!!!! """###

~

$ %21T

21 µµwµ~µ~)w(J &!&!

x1

x2

'1

'2This axis yields better class separability

This axis has a larger distance between means

LDA, two-classes(3)

The solution proposed by Fisher is to maximize a function that represents the difference

between the means, normalized by a measure of the within-class scatter.

• For each class we define the scatter, an equivalent of the variance, as

s2i =

Xy∈ωi

(y − µi)2

Then the quantity (s21 + s2

2) is called the within-class scatter of the projected

examples.

• The Fisher linear discriminant is defined as the linear projection wTx that maximizes

the criterion function

J(w) =|µ1 − µ2|2

s21 + s2

2

• Therefore, we will be looking for a projection where examples from the same class

are projected very close to each other and, at the same time, the projected means

are as farther apart as possible.


4

Linear Discriminant Analysis, two-classes (3)! The solution proposed by Fisher is to maximize a function that represents the

difference between the means, normalized by a measure of the within-class scatter

" For each class we define the scatter, an equivalent of the variance, as

! where the quantity is called the within-class scatter of the projected examples" The Fisher linear discriminant is defined as the linear function wTx that maximizes the

criterion function

" Therefore, we will be looking for a projection where examples from the same class are projected very close to each other and, at the same time, the projected means are as farther apart as possible

! "#$

%&i!y

2i

2i µ~ys~

! "22

21 s~s~ '

22

21

221

s~s~µ~µ~

)w(J'

%&

x1

x2

(1

(2

LDA, two-classes(4)

• To find the optimum projection w∗, we need to express J(w) as an explicit function

of w.

• We define a measure of the scatter in multivariate feature space x, which are scatter

matrices.

SW = S1 + S2, where Si =Xx∈ωi

(x− µi)(x− µi)T

The matrix SW is called the within-class scatter matrix.

• The scatter of the projection y can then be expressed as a function of the scatter

matrix in feature space x.

s2i =

Xy∈ωi

(y − µi)2=

Xx∈ωi

wT(x− µi)(x− µi)

Tw = wTSiw

s21 + s

22 = wT

SWw

• Similarly, the difference between the projected means can be expressed in terms of

the means in the original feature space

(µ1 − µ2)2= wT

(µ1 − µ2)(µ1 − µ2)T| {z }

SB

w = wTSBw

The matrix SB is called the between-class scatter. Note that, since SB is the outer

product of two vectors, its rank is at most one.

• We can finally express the Fisher criterion in terms of SW and SB as

J(w) =wTSBwwTSWw

LDA, two-classes(5)

• To find the maximum of J(w) we differentiate and set to zero

d

dwJ(w) =

d

dw

"wTSBwwTSWw

#= 0

=⇒ (wTSWw)

d

dw

hwT

SBwi− (wT

SBw)d

dw

hwT

SWwi

= 0

=⇒ (wTSWw)2SBw − (wT

SBw)2SWw = 0

=⇒ (wTSWw)SBw = (wT

SBw)SWw

• From the definition of SB we see that SBw ∝ (µ1 − µ2)

• We do not care about the magnitude of w only its direction.

• Then by dropping any scale factors and multiplying both sides by S−1W .

w ∝ S−1W (µ1 − µ2)

• In summary get the Fisher Linear Discriminant (1936)

w∗ = arg maxw

wT SBw

wT SW w= S−1

W (µ1 − µ2).

Although it should be noted it is not a discriminant but rather a specific choice of

direction for the projection of the data down to one dimension.

LDA Classifier

Use w to construct a classifier. Project the training data onto the line w. Must find a

threshold θ such that

wTx ≥ θ =⇒ x belongs to class ω1

wTx < θ =⇒ x belongs to class ω2

How should we learn θ ?

LDA Classifier

Use w to construct a classifier. Project the training data onto the line w. Must find a

threshold θ such that

wTx ≥ θ =⇒ x belongs to class ω1

wTx < θ =⇒ x belongs to class ω2

How should we learn θ ?

• Project the training data from both classes onto w.

• Search for θ such that the number of mis-classifications is minimized.

LDA, example

Compute the Linear Discriminant projection for the following two-dimensional dataset

X1 = {x1, · · · , x5} = {(4, 1), (2, 4), (2, 3), (3, 6), (4, 4)}

X2 = {x1, · · · , x5} = {(9, 10), (6, 8), (9, 5), (8, 7), (10, 8)}


7

LDA example! Compute the Linear Discriminant projection for

the following two-dimensional dataset" X1=(x1,x2)={(4,1),(2,4),(2,3),(3,6),(4,4)}" X2=(x1,x2)={(9,10),(6,8),(9,5),(8,7),(10,8)}

! SOLUTION (by hand)" The class statistics are:

" The within- and between-class scatter are

" The LDA projection is then obtained as the solution of the generalized eigenvalue problem

" Or directly by

! " ! "7.608.40µ;3.603.00µ2.640.04-0.04-1.84

S;2.600.40-0.40-0.80

S

21

21

##

$%

&'(

)#$

%

&'(

)#

0 2 4 6 8 100

2

4

6

8

10

x2

x1

wLDA

$%

&'(

)#$

%

&'(

)#

5.280.44-0.44-2.64

S;16.0021.6021.6029.16

S WB

* + ! "T211

W 0.390.91µµSw* ,,#,# ,

$%

&'(

)#$

%

&'(

)-$

%

&'(

)#$

%

&'(

)$%

&'(

)

#-#-#,-#

0.390.91

vv

vv

15.65vv

3.765.088.8111.89

15.65!0!-3.765.08

8.81!-11.890!ISS!vvSS

2

1

2

1

2

1

B1-WB

1-W

LDA, example

Compute the Linear Discriminant projection for the following two-dimensional dataset

X1 = {x1, · · · , x5} = {(4, 1), (2, 4), (2, 3), (3, 6), (4, 4)}

X2 = {x1, · · · , x5} = {(9, 10), (6, 8), (9, 5), (8, 7), (10, 8)}

Solution (by hand):

• The class statistics are:

S1 =

„.80 −.40

−.40 2.60

«S2 =

„1.84 −.04

−.04 2.64

«µ1 = (3.0, 3.6)

Tµ2 = (8.4, 7.6)

T

• The within- and between-class scatter are

SB =

„29.16 21.6

21.6 16.0

«SW =

„2.64 −.44

−.44 5.28

«• The LDA projection is then obtained as the solution of the generalized eigenvalue

problem

S−1W SBv = λv =⇒ |S−1

W SB − λI| = 0

=⇒˛

11.89− λ 8.81

5.08 3.76− λ

˛= 0

=⇒ λ = 15.65

Then plugging in to find the eigenvector which is the optimal projection direction„11.89 8.81

5.08 3.76

«v = 15.65v =⇒ v =

„0.91

0.39

«• Or directly by

w∗ = S−1W (µ1 − µ2) = (−.91,−.39)

T



• Linear Discriminant Analysis, C classes

LDA, C-classes

Fisher’s LDA generalizes very gracefully for C-class problems. Instead of one projection

y, we now seek (C − 1) projections (y1, y2, · · · , yC−1) by means of (C − 1)

projection vectors wi, which can be arranged by columns into a projection matrix

W = [w1w2 · · ·wC−1].

Derivation

• The generalization of the within-class scatter is

SW =CX

i=1

Si, where Si =Xx∈ωi

(x− µi)(x− µi)T

and µi =1

Ni

Xx∈ωi

x

• The generalization for the between-class scatter is

SB =CX

i=1

Ni(µi − µ)(µi − µ)T, where µ =

1

N

X∀x

x =1

N

CXi=1

Niµi


8

Linear Discriminant Analysis, C-classes (1)! Fisher’s LDA generalizes very gracefully for C-class problems

" Instead of one projection y, we will now seek (C-1) projections [y1,y2,…,yC-1] by means of (C-1) projection vectors wi, which can be arranged by columns into a projection matrix W=[w1|w2|…|wC-1]:

! Derivation" The generalization of the within-class scatter is

" The generalization for the between-class scatter is

! where ST=SB+SW is called the total scatter matrix

! "! " ##

#

$$

%

%&&%

%

ii !xii

!x

Tiii

C

1iiW

xN1µandµxµxSwhere

SS

! "! "

##

#

$'

%

%%

&&%

i!xii

x

C

1i

TiiiB

µNN1x

N1µwhere

µµµµNS

xWyxwy TTii %(%

)1

)2

)3

)

SB1

SB3

SB2SW3

SW1

SW2

x1

x2

)1

)2

)3

)

SB1

SB3

SB2SW3

SW1

SW2

x1

x2

• Similarly, define the mean vector and scatter matrices for the projected samples as

µi =1

Ni

Xy∈ωi

y SW =CX

i=1

Xy∈ωi

(y − µi)(y − µi)T

µ =1

N

X∀y

y SB =CX

i=1

Ni(µi − µ)(µi − µ)T

• From our derivation for the two-class problem, we can write

SW = WTSWW, SB = W

TSBW

• Recall that we are looking for a projection that maximizes the ratio of between-class to within-class scatter. Since the projection is no longer a scalar (it has C−1

dimensions), we then use the determinant of the scatter matrices to obtain a scalar

objective function.

J(W ) =

˛SB

˛˛SW

˛ =

˛W TSBW

˛|W TSWW |

• We seek the projection matrix W ∗ that maximizes this ratio.

LDA, C-classes (2)

It can be shown that the optimal projection matrix W ∗ is the one whose columns are

the eigenvectors corresponding to the largest eigenvalues of the following generalized

eigenvalue problem.

W ∗ =ˆw∗1w

∗2 · · ·w

∗C−1

˜= arg max

W

„ ˛WT SBW

˛|WT SW W |

«=⇒ (SB − λiSW )w∗i = 0

• SB is the sum of C matrices of rank one or less and the mean vectors are constrained

by

µ =1

C

CXi=1

µi

Therefore, SB will be of rank≤ (C−1) =⇒ that only (C−1) of the eigenvalues

λi will be non-zero.

• The projections with maximum class separability information are the eigenvectors

corresponding to the largest eigenvalues of S−1W SB.

• LDA can be derived as the Maximum Likelihood method for the case of normal class

conditional densities with equal covariance matrices.




• LDA Vs. PCA example

Coffee discrimination with gas sensor array

• Five types of coffee beans were presented

to an array of chemical gas sensors

• For each coffee type, 45 sniffs were

performed and the response of the gas

sensor array was processed in order to

obtain a 60-dimensional feature vector


11

0 50 100 150 200

-40

-20

0

20

40

60

Sulaw esy Kenya A rabian Sumatra Colombia

Sen

sor r

espo

nse

-260-240

-220-200 70

80

90

10010

15

20

25

30

35

25

131

2

1

5

53

521

53

53

4

3 33

31 41 4 554

5

45

5

3

3

1

4

1

52

5

5

51

5

2

534 2

4 3

13

1

351232

21

452

4

24

1

1

45 3

5

54

414

2

45

4 32

5

53

1

331 32

2 52

32

1

3 44

1

1

21

3

1

1

2553 4

12

312

44 45 41

1

2

4

54

322 3

52222

2

24

532

4 45 52 1515 4 4

4341 1

1 33232412 3 53

214

34 5

2 1 4153 54

3

1

4242 1 52

3 1 51

23 4 51 5135

432 4

41

43

22313

2

axis 1

axis 2

axis

3

PCA

-1.96-1.94

-1.92-1.9

-1.88 0.3

0.35

0.47.32

7.34

7.36

7.38

7.4

7.42

44444

2

44 44

4

2

444

4 44 44

4 44 4

2

4 44 44

44

2

4333 4443 43 54

2

5

2

43

44

3

2

54

2

54

2

53

2

3

222

4 4

22

34

3

2

43

22

3

222

1 31 53

5

2

3

2

1

2

5

2

5

2

3 55

2

3133

2

5

2

35

13 53 55

2

33 151

55

2

3 5535

33

22 22

153 5

31335

13

45

2

1 153

1

2 2

53 5

22

5

2

3113

32

5

2

55

2

51 1 1

3 5331

2

551

5131

1 5 551

11113 11

131 5

11 11

1 5 515

1111 111

axis 1

axis 2

axis

3

LDA

LDA Vs. PCA: Coffee discrimination with a gas sensor array! These figures show the performance of PCA and

LDA on an odor recognition problem" Five types of coffee beans were presented to an array

of chemical gas sensors" For each coffee type, 45 “sniffs” were performed and

the response of the gas sensor array was processed in order to obtain a 60-dimensional feature vector

! Results" From the 3D scatter plots it is clear that LDA

outperforms PCA in terms of class discrimination" This is one example where the discriminatory

information is not aligned with the direction of maximum variance

The following figures show the performance of PCA and LDA on this odor recognition

problem

Results:

• From the 3D scatter plots it is clear that LDA outperforms PCA in terms of class

discrimination

• This is one example where the discriminatory information is not aligned with the

direction of maximum variance


11

0 50 100 150 200

-40

-20

0

20

40

60

Sulaw esy Kenya A rabian Sumatra Colombia

Sen

sor r

espo

nse

-260-240

-220-200 70

80

90

10010

15

20

25

30

35

25

131

2

1

5

53

521

53

53

4

3 33

31 41 4 554

5

45

5

3

3

1

4

1

52

5

5

51

5

2

534 2

4 3

13

1

351232

21

452

4

24

1

1

45 3

5

54

414

2

45

4 32

5

53

1

331 32

2 52

32

1

3 44

1

1

21

3

1

1

2553 4

12

312

44 45 41

1

2

4

54

322 3

52222

2

24

532

4 45 52 1515 4 4

4341 1

1 33232412 3 53

214

34 5

2 1 4153 54

3

1

4242 1 52

3 1 51

23 4 51 5135

432 4

41

43

22313

2

axis 1

axis 2

axis

3

PCA

-1.96-1.94

-1.92-1.9

-1.88 0.3

0.35

0.47.32

7.34

7.36

7.38

7.4

7.42

44444

2

44 44

4

2

444

4 44 44

4 44 4

2

4 44 44

44

2

4333 4443 43 54

2

5

2

43

44

3

2

54

2

54

2

53

2

3

222

4 4

22

34

3

2

43

22

3

222

1 31 53

5

2

3

2

1

2

5

2

5

2

3 55

2

3133

2

5

2

35

13 53 55

2

33 151

55

2

3 5535

33

22 22

153 5

31335

13

45

2

1 153

1

2 2

53 5

22

5

2

3113

32

5

2

55

2

51 1 1

3 5331

2

551

5131

1 5 551

11113 11

131 5

11 11

1 5 515

1111 111

axis 1

axis 2

axis

3

LDA

LDA Vs. PCA: Coffee discrimination with a gas sensor array! These figures show the performance of PCA and

LDA on an odor recognition problem" Five types of coffee beans were presented to an array

of chemical gas sensors" For each coffee type, 45 “sniffs” were performed and

the response of the gas sensor array was processed in order to obtain a 60-dimensional feature vector

! Results" From the 3D scatter plots it is clear that LDA

outperforms PCA in terms of class discrimination" This is one example where the discriminatory

information is not aligned with the direction of maximum variance





• Limitations of LDA

Limitations of LDA

• LDA produces at most C − 1 feature projections

– If the classification error estimates establish that more features are needed, some

other method must be employed to provide those additional features

• LDA is a parametric method since it assumes unimodal Gaussian likelihoods

– If the distributions are significantly non-Gaussian, the LDA projections will not

be able to preserve any complex structure of the data, which may be needed for

classification.


12

Limitations of LDA! LDA produces at most C-1 feature projections

" If the classification error estimates establish that more features are needed, some other method must be employed to provide those additional features

! LDA is a parametric method since it assumes unimodal Gaussian likelihoods" If the distributions are significantly non-Gaussian, the LDA projections will not be able to

preserve any complex structure of the data, which may be needed for classification

! LDA will fail when the discriminatory information is not in the mean but rather in the variance of the data

!1 !2

!2 !1

"1="2="

!1

!2

"1

"2

"1="2="!1

!2

!1 !2

!2 !1

"1="2="

!1

!2

"1

"2

"1="2="!1

!2

x1

x2

LDAPCA

x1

x2

LDAPCA

• LDA will fail when the discriminatory information is not in the mean but rather in

the variance of the data.


12

Limitations of LDA! LDA produces at most C-1 feature projections

" If the classification error estimates establish that more features are needed, some other method must be employed to provide those additional features

! LDA is a parametric method since it assumes unimodal Gaussian likelihoods" If the distributions are significantly non-Gaussian, the LDA projections will not be able to

preserve any complex structure of the data, which may be needed for classification

! LDA will fail when the discriminatory information is not in the mean but rather in the variance of the data

!1 !2

!2 !1

"1="2="

!1

!2

"1

"2

"1="2="!1

!2

!1 !2

!2 !1

"1="2="

!1

!2

"1

"2

"1="2="!1

!2

x1

x2

LDAPCA

x1

x2

LDAPCA






• Variants of LDA

Variants of LDA

• Non-parametric LDA (Fukunaga)

– NPLDA removes the unimodal Gaussian assumption by computing the between-

class scatter matrix SB using local information and the K Nearest Neighbors

rule.

∗ The matrix SB is full-rank, allowing the extraction of more than (C − 1)

features

∗ The projections are able to preserve the structure of the data more closely

• Orthonormal LDA (Okada and Tomita)

– OLDA computes projections that maximize the Fisher criterion and, at the same

time, are pair-wise orthonormal

∗ The method used in OLDA combines the eigenvalue solution of S−1W SB and

the Gram-Schmidt ortho-normalization procedure.

∗ OLDA sequentially finds axes that maximize the Fisher criterion in the subspace

orthogonal to all features already extracted

∗ OLDA is also capable of finding more than (C − 1) features

• Generalized LDA (Lowe)

– GLDA generalizes the Fisher criterion by incorporating a cost function similar to

the one used to compute the Bayes Risk

∗ The effect of this generalized criterion is an LDA projection with a structure

that is biased by the cost function

∗ Classes with a higher cost Cij will be placed further apart in the low-dimensional

projection

• Multilayer Perceptrons (Webb and Lowe)

– It has been shown that the hidden layers of multi-layer perceptrons (MLP) perform

non-linear discriminant analysis by maximizing tr[SBST ], where the scatter

matrices are measured at the output of the last hidden layer.






• Variants of LDA

• Other dimensionality reduction methods

Other dimensionality reductionmethods

Exploratory Projection Pursuit (Friedman and Tukey)

• EPP seeks an M -dimensional (M = 2, 3 typically) linear projection of the data that

maximizes a measure of interestingness.

• Interestingness is measured as departure from multivariate normality.

– This measure is not the variance and is commonly scale-free. In most proposals

it is also affine invariant, so it does not depend on correlations between features.

[Ripley, 1996]

• In other words, EPP seeks projections that separate clusters as much as possible and

keeps these clusters compact, a similar criterion as Fisher’s, but EPP does NOT use

class labels.

• Once an interesting projection is found, it is important to remove the structure it

reveals to allow other interesting views to be found more easily.


14

Other dimensionality reduction methods (1)! Exploratory Projection Pursuit (Friedman and Tukey)

" EPP seeks an M-dimensional (M=2,3 typically) linear projection of the data that maximizes a measure of “interestingness”

" Interestingness is measured as departure from multivariate normality! This measure is not the variance and is commonly scale-free. In most proposals it is also affine

invariant, so it does not depend on correlations between features . [Ripley, 1996]" In other words, EPP seeks projections that separate clusters as much as possible and keeps

these clusters compact, a similar criterion as Fisher’s, but EPP does NOT use class labels" Once an interesting projection is found, it is important to remove the structure it reveals to

allow other interesting views to be found more easily

x1

x2

x1

x2

INTERESTINGUNINTERESTING Other dimensionality reductionmethods

Sammon’s Non-linear Mapping (Sammon)

• This method seeks a mapping onto an M -dimensional space that preserves the

inter-point distances of the original N -dimensional space.

– This is accomplished by minimizing the following objective function

Cost(d, d′) =

Xi 6=j

“d(Pi, Pj)− d(P′i, P

′j)

”2

d(Pi, Pj)

∗ The original method did not obtain an explicit mapping but only a lookup table

for the elements in the training set

∗ Recent implementations using artificial neural networks (MLPs and RBFs) do

provide an explicit mapping for test data and also consider cost functions

(Neuroscale)

∗ Sammon’s mapping is closely related to Multi-Dimensional Scaling (MDS), a

family of multivariate statistical methods commonly used in the social sciences


15

Other dimensionality reduction methods (2)! Sammon’s Non-linear Mapping (Sammon)

" This method seeks a mapping onto an M-dimensional space that preserves the inter-point distances of the original N-dimensional space

! This is accomplished by minimizing the following objective function

" The original method did not obtain an explicit mapping but only a lookup table for the elements in the training set

" Recent implementations using artificial neural networks (MLPs and RBFs) do provide an explicit mapping for test data and also consider cost functions (Neuroscale)

" Sammon’s mapping is closely related to Multi-Dimensional Scaling (MDS), a family of multivariate statistical methods commonly used in the social sciences

! "#$

%&

ji ji

2'j

'iji

)P,P(d)P,P(d)P,P(d

)'d,d(E

P’j

x1

x2

P’iPi

Pj

x3

x2

x1

d(Pi, Pj)= d(P’i, P’j) ' i,j

P’j

x1

x2

P’i

P’j

x1

x2

P’iPi

Pj

x3

x2

x1

d(Pi, Pj)= d(P’i, P’j) ' i,j

lecture 7 principal components analysis the curse of ... · lecture 7 principal component analysis...

Documents