linear discriminant analysis, two classes linear discriminant
TRANSCRIPT
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 1
L10: Linear discriminants analysis
β’ Linear discriminant analysis, two classes
β’ Linear discriminant analysis, C classes
β’ LDA vs. PCA
β’ Limitations of LDA
β’ Variants of LDA
β’ Other dimensionality reduction methods
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 2
Linear discriminant analysis, two-classes
β’ Objective β LDA seeks to reduce dimensionality while preserving as much of the
class discriminatory information as possible
β Assume we have a set of π·-dimensional samples π₯(1, π₯(2, β¦ π₯(π , π1
of which belong to class π1, and π2 to class π2
β We seek to obtain a scalar π¦ by projecting the samples π₯ onto a line
π¦ = π€ππ₯
β Of all the possible lines we would like to select the one that maximizes the separability of the scalars
x1
x2
x1
x2
x1
x2
x1
x2
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3
β In order to find a good projection vector, we need to define a measure of separation
β The mean vector of each class in π₯-space and π¦-space is
ππ =1
ππ π₯π₯βππ
and π π =1
ππ π¦π¦βππ
=1
ππ π€ππ₯π₯βππ
= π€πππ
β We could then choose the distance between the projected means as our objective function
π½ π€ = π 1 β π 2 = π€π π1 β π2
β’ However, the distance between projected means is not a good measure since it does not account for the standard deviation within classes
x1
x2
1
2
This axis yields better class separability
This axis has a larger distance between means
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 4
β’ Fisherβs solution β Fisher suggested maximizing the difference between the means,
normalized by a measure of the within-class scatter
β For each class we define the scatter, an equivalent of the variance, as
π π2 = π¦ β π π
2π¦βππ
β’ where the quantity π 12 + π 2
2 is called the within-class scatter of the projected examples
β The Fisher linear discriminant is defined as the linear function π€ππ₯ that maximizes the criterion function
π½ π€ =π 1βπ 2
2
π 12+π 2
2
β Therefore, we are looking for a projection where examples from the same class are projected very close to each other and, at the same time, the projected means are as farther apart as possible
x1
x2
1
2
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 5
β’ To find the optimum π€β, we must express π½(π€) as a function of π€ β First, we define a measure of the scatter in feature space π₯
ππ = π₯ β ππ π₯ β πππ
π₯βππ
π1 + π2 = ππ β’ where ππ is called the within-class scatter matrix
β The scatter of the projection π¦ can then be expressed as a function of the scatter matrix in feature space π₯
π π2 = π¦ β π π
2π¦βππ
= π€ππ₯ β π€πππ2
π₯βππ=
= π€π π₯ β ππ π₯ β ππππ€π₯βππ
= π€ππππ€
π 1
2 + π 22 = π€ππππ€
β Similarly, the difference between the projected means can be expressed in terms of the means in the original feature space
π 1 β π 22 = π€ππ1 β π€ππ2
2 = π€π π1 β π2 π1 β π2π
ππ΅
π€ = π€πππ΅π€
β’ The matrix ππ΅ is called the between-class scatter. Note that, since ππ΅ is the outer product of two vectors, its rank is at most one
β We can finally express the Fisher criterion in terms of ππ and ππ΅ as
π½ π€ =π€πππ΅π€
π€ππππ€
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 6
β To find the maximum of π½(π€) we derive and equate to zero π
ππ€π½ π€ =
π
ππ€
π€πππ΅π€
π€ππππ€= 0 β
π€ππππ€π π€πππ΅π€
ππ€β π€πππ΅π€
π π€ππππ€
ππ€= 0 β
π€ππππ€ 2ππ΅π€ β π€πππ΅π€ 2πππ€ = 0
β Dividing by π€ππππ€
π€ππππ€
π€ππππ€ππ΅π€ β
π€πππ΅π€
π€ππππ€πππ€ = 0 β
ππ΅π€ β π½πππ€ = 0 β ππ
β1ππ΅π€ β π½π€ = 0
β Solving the generalized eigenvalue problem (ππβ1ππ΅π€ = π½π€) yields
π€β = arg maxπ€πππ΅π€
π€ππππ€= ππ
β1 π1 β π2
β This is know as Fisherβs linear discriminant (1936), although it is not a discriminant but rather a specific choice of direction for the projection of the data down to one dimension
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 7
Example β’ Compute the LDA projection for the
following 2D dataset π1 = {(4,1), (2,4), (2,3), (3,6), (4,4)}
π2 = {(9,10), (6,8), (9,5), (8,7), (10,8)}
β’ SOLUTION (by hand) β The class statistics are
π1 =.8 β.4
2.64 π2 =
1.84 β.042.64
π1 = 3.0 3.6 π; π2 = 8.4 7.6 π
β The within- and between-class scatter are
ππ΅ =29.16 21.6
16.0 ππ =
2.64 β.445.28
β The LDA projection is then obtained as the solution of the generalized eigenvalue problem
ππβ1ππ΅π£ = ππ£ β ππ
β1ππ΅ β ππΌ = 0 β11.89 β π 8.81
5.08 3.76 β π= 0 β π = 15.65
11.89 8.815.08 3.76
π£1
π£2= 15.65
π£1
π£2β
π£1
π£2=
.91
.39
β Or directly by
π€β = ππβ1 π1 β π2 = β.91 β .39 π
0 2 4 6 8 10 0
2
4
6
8
10
x2
x1
wLDA
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 8
LDA, C classes
β’ Fisherβs LDA generalizes gracefully for C-class problems β Instead of one projection π¦, we will now seek (πΆ β 1) projections
[π¦1, π¦2, β¦ π¦πΆβ1] by means of (πΆ β 1) projection vectors π€πarranged by columns into a projection matrix π = [π€1|π€2| β¦ |π€πΆβ1]:
π¦π = π€πππ₯ β π¦ = πππ₯
β’ Derivation β The within-class scatter generalizes as
ππ = πππΆπ=1
β’ where ππ = π₯ β ππ π₯ β πππ
π₯βππ
and ππ =1
ππ π₯π₯βππ
β And the between-class scatter becomes
ππ΅ = ππ ππ β π ππ β π ππΆπ=1
β’ where π =1
π π₯βπ₯ =
1
π ππππ
πΆπ=1
β Matrix ππ = ππ΅ + ππ is called the total scatter
1
2
3
SB 1
SB 3
SB 2
SW 3
SW 1
SW 2
x1
x2
1
2
3
SB 1
SB 3
SB 2
SW 3
SW 1
SW 2
x1
x2
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 9
β Similarly, we define the mean vector and scatter matrices for the projected samples as
π π =1
Nπ π¦π¦βππ
π π = π¦ β π π π¦ β π ππ
π¦βππ
πΆπ=1
π =1
π π¦βπ¦ π π΅ = ππ π π β π π π β π ππΆ
π=1
β From our derivation for the two-class problem, we can write π π = πππππ π π΅ = ππππ΅π
β Recall that we are looking for a projection that maximizes the ratio of between-class to within-class scatter. Since the projection is no longer a scalar (it has πΆ β 1 dimensions), we use the determinant of the scatter matrices to obtain a scalar objective function
π½ π =π π΅
π π=
ππππ΅π
πππππ
β And we will seek the projection matrix πβ that maximizes this ratio
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 10
β It can be shown that the optimal projection matrix πβ is the one whose columns are the eigenvectors corresponding to the largest eigenvalues of the following generalized eigenvalue problem
πβ = π€1β π€2
β β¦π€πΆβ1β = arg max
ππππ΅π
πππππβ ππ΅ β ππππ π€π
β = 0
β’ NOTES β ππ΅ is the sum of πΆ matrices of rank β€ 1 and the mean vectors are
constrained by 1
πΆ ππ
πΆπ=1 = π
β’ Therefore, ππ΅ will be of rank (πΆ β 1) or less
β’ This means that only (πΆ β 1) of the eigenvalues ππ will be non-zero
β The projections with maximum class separability information are the eigenvectors corresponding to the largest eigenvalues of ππ
β1ππ΅
β LDA can be derived as the Maximum Likelihood method for the case of normal class-conditional densities with equal covariance matrices
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 11
LDA vs. PCA β’ This example illustrates the performance of PCA
and LDA on an odor recognition problem β Five types of coffee beans were presented to an array
of gas sensors β For each coffee type, 45 βsniffsβ were performed and
the response of the gas sensor array was processed in order to obtain a 60-dimensional feature vector
β’ Results β From the 3D scatter plots it is clear that LDA
outperforms PCA in terms of class discrimination β This is one example where the discriminatory
information is not aligned with the direction of maximum variance 0 50 100 150 200
-40
-20
0
20
40
60
Sulaw esy Kenya Arabian Sumatra Colombia
Sen
sor
resp
onse
normalized data
-260-240
-220-200 70
80
90
10010
15
20
25
30
35
25
13
1
2
1
5
53
521
53
53
4
3 33
31 41 4 5
54
5
45
5
3
3
1
4
1
52
5
5
51
5
2
534 2
4 3
13
1
35123
2
21
45
2
4
2
4
1
1
45 3
5
5
4
414
2
4
54 32
5
53
1
331
32
2 52
32
1
3 44
1
1
2
13
1
1
255
3 4
12
31
2
44 45 41
1
2
4
5
4
322 3
52222
2
2
45
324 45 5
215154 4
4341 1
1 33232412 3
53
2
14
34
52 1 4
153 54
3
1
42421 5
23 1 5
123 4 51 513
543
2 44
1
43
22313
2
axis 1
axis 2
axis
3
PCA
-1.96-1.94
-1.92-1.9
-1.88 0.3
0.35
0.47.32
7.34
7.36
7.38
7.4
7.42
4444
4
2
44 4
44
2
444
4 44 44
4 44 4
2
4 44 4
44
4
2
4333 44
4343
54
2
5
2
4
34
4
3
2
54
2
54
2
53
2
3
22
2
4 4
22
34
3
2
43
22
3
222
1 31 5
35
2
3
2
1
2
5
2
5
2
3 55
2
3133
2
5
2
3
5
13 53 55
2
331
51
5
5
2
3 5535
33
2222
1
535
31
335
13
45
2
1 1
531
22
53 5
22
5
2
3113
3
2
5
2
55
2
51
1 13 5
331
2
551
513
11 5 5
511
1113
11
131 5
11 1
1
1 5 515
1111
111
axis 1
axis 2
axis
3
LDA
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 12
Limitations of LDA β’ LDA produces at most πΆ β 1 feature projections
β If the classification error estimates establish that more features are needed, some other method must be employed to provide those additional features
β’ LDA is a parametric method (it assumes unimodal Gaussian likelihoods)
β If the distributions are significantly non-Gaussian, the LDA projections may not preserve complex structure in the data needed for classification
β’ LDA will also fail if discriminatory information is not in the mean but in the variance of the data
1
2
2
1
1=
2=
1
2
1
2
1=
2=
1
2
1
2
2
1
1=
2=
1
2
1
2
1=
2=
1
2
x1
x2
LD
APC
A
x1
x2
LD
APC
A
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 13
Variants of LDA β’ Non-parametric LDA (Fukunaga)
β NPLDA relaxes the unimodal Gaussian assumption by computing ππ΅ using local information and the kNN rule. As a result of this β’ The matrix ππ΅ is full-rank, allowing us to extract more than (πΆ β 1) features β’ The projections are able to preserve the structure of the data more closely
β’ Orthonormal LDA (Okada and Tomita) β OLDA computes projections that maximize the Fisher criterion and, at the same
time, are pair-wise orthonormal β’ The method used in OLDA combines the eigenvalue solution of ππ
β1ππ΅ and the Gram-Schmidt orthonormalization procedure
β’ OLDA sequentially finds axes that maximize the Fisher criterion in the subspace orthogonal to all features already extracted
β’ OLDA is also capable of finding more than (πΆ β 1) features
β’ Generalized LDA (Lowe) β GLDA generalizes the Fisher criterion by incorporating a cost function similar to the
one we used to compute the Bayes Risk β’ As a result, LDA can produce projects that are biased by the cost function, i.e., classes
with a higher cost πΆππ will be placed further apart in the low-dimensional projection
β’ Multilayer perceptrons (Webb and Lowe) β It has been shown that the hidden layers of multi-layer perceptrons perform non-
linear discriminant analysis by maximizing ππ[ππ΅ππβ ], where the scatter matrices
are measured at the output of the last hidden layer
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 14
Other dimensionality reduction methods β’ Exploratory Projection Pursuit (Friedman and Tukey)
β EPP seeks an M-dimensional (M=2,3 typically) linear projection of the data that maximizes a measure of βinterestingnessβ
β Interestingness is measured as departure from multivariate normality
β’ This measure is not the variance and is commonly scale-free. In most implementations it is also affine invariant, so it does not depend on correlations between features. [Ripley, 1996]
β In other words, EPP seeks projections that separate clusters as much as possible and keeps these clusters compact, a similar criterion as Fisherβs, but EPP does NOT use class labels
β Once an interesting projection is found, it is important to remove the structure it reveals to allow other interesting views to be found more easily
x1
x2
x1
x2
Interesting Uninteresting
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 15
β’ Sammonβs non-linear mapping (Sammon)
β This method seeks a mapping onto an M-dimensional space that preserves the inter-point distances in the original N-dimensional space
β This is accomplished by minimizing the following objective function
πΈ π, πβ² = π ππ,ππ βπ ππ
β²,ππβ²
2
π ππ,πππβ π
β’ The original method did not obtain an explicit mapping but only a lookup table for the elements in the training set
β Newer implementations based on neural networks do provide an explicit mapping for test data and also consider cost functions (e.g., Neuroscale)
β’ Sammonβs mapping is closely related to Multi Dimensional Scaling (MDS), a family of multivariate statistical methods commonly used in the social sciences
β We will review MDS techniques when we cover manifold learning
Pβj
x1
x2
Pβi
Pi
Pj
x3
x2
x1
d(Pi, Pj)= d(Pβi, Pβj) " i,j
Pβj
x1
x2
Pβi
Pβj
x1
x2
Pβi
Pi
Pj
x3
x2
x1
d(Pi, Pj)= d(Pβi, Pβj) " i,j