Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,
65
Discriminative Feature Extraction and Dimension Reduction - PCA, LDA and LSA References: 1. Introduction to Machine Learning , Chapter 6 2. Data Mining: Concepts, Models, Methods and Algorithms, Chapter 3 Berlin Chen Graduate Institute of Computer Science & Information Engineering National Taiwan Normal University
Discriminative Feature Extraction and Dimension Reduction
- PCA LDA and LSA
References1 Introduction to Machine Learning Chapter 6 2 Data Mining Concepts Models Methods and Algorithms Chapter 3
Berlin ChenGraduate Institute of Computer Science amp Information Engineering
National Taiwan Normal University
ML-2
Introduction (13)
bull Goal discover significant patterns or features from the input datandash Salient feature selection or dimensionality reduction
ndash Compute an input-output mapping based on some desirable properties
Networkx yInput space Feature space
W
ML-3
Introduction (23)
bull Principal Component Analysis (PCA)
bull Linear Discriminant Analysis (LDA)
bull Latent Semantic Analysis (LSA)
bull Heteroscedastic Discriminant Analysis (HDA)
ML-4
Introduction (33)
bull Formulation for discriminative feature extraction ndash Model-free (nonparametric)
bull Without prior information eg PCA bull With prior information eg LDA
ndash Model-dependent (parametric)bull Eg PLSA (Probabilistic Latent Semantic Analysis) with EM
(Expectation-Maximization) MCE (Minimum Classification Error) Training
ML-5
Principal Component Analysis (PCA) (12)
bull Known as Karhunen-Loẻve Transform (1947 1963)
ndash Or Hotelling Transform (1933)
bull A standard technique commonly used for data reduction in statistical pattern recognition and signal processing
bull A transform by which the data set can be represented by reduced number of effective features and still retain the most intrinsic information contentndash A small set of features to be found to represent the data samples
accurately
bull Also called ldquoSubspace Decompositionrdquo ldquoFactor Analysisrdquo
Pearson 1901
ML-6
Principal Component Analysis (PCA) (22)
The patterns show a significant differencefrom each other in one of the transformed axes
ML-7
PCA Derivations (113)
bull Suppose x is an n-dimensional zero mean random vectorndash If x is not zero mean we can subtract the mean
before processing the following analysis
ndash x can be represented without error by the summation of n linearly independent vectors
sum ===
n
iiiiy Φyφx [ ]Tni yyy where 1=y
[ ]ni φφφΦ 1=
0xμ x == E
The basis vectorsThe i-th component
in the feature (mapped) space
ML-8
PCA Derivations (213)
(23)
(01)
(10)
(11)(-11)
(52rsquo12rsquo)
⎥⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡minus+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡11
121
11 1
21
11
25
10
301
232
orthogonal basis sets
ML-9
PCA Derivations (313)
ndash Further assume the column (basis) vectors of the matrix form an orthonormal set
bull Such that is equal to the projectionof on
⎩⎨⎧
ne=
=jiji
jTi if 0
if 1φφ
iφxxx T
iiT
ii y ϕϕ ==forall
iy
x1ϕ
2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Φ
ML-10
PCA Derivations (413)
ndash Further assume the column (basis) vectors of the matrix form an orthonormal set
bull also has the following propertiesndash Its mean is zero too
ndash Its variance is
bull The correlation between two projections and is
Φiy
0==== 0xx Ti
Ti
Tii EEyE ϕϕϕ jy
( )( ) j
Tij
TTi
jTT
iTT
jTiji
E
EEyyE
Rφφφxxφ
φxxφxφxφ
==
==
[ ] [ ]xRR
xxxx
ofmatrix relation (auto-)cor theis
2222
iTi
iTT
iiTT
i EEyEyEyEii
i
ϕϕ
ϕϕϕϕσ
=
===minus=
( )( )
sum==
minus⎟⎠⎞
⎜⎝⎛
sumasymp
minusminus=
=
i
Tii
T
TN
i
Tii
T
NE
N
E
xxxxR
μμxx
μxμxΣ
1
1
][
1 0
iy jy
ML-11
PCA Derivations (513)
bull Minimum Mean-Squared Error Criterionndash We want to choose only m of that we still can
approximate well in mean-squared error criterionxsiφ
sumsumsum+===
+==n
mjjj
m
ii
n
ii yyy
111φφφx ii
( ) sum=
=m
iiym
1ˆ iφx
( ) ( )
sumsum
sum
sum sum
sumsum
+=+=
+=
+= +=
+=+=
==
=
⎭⎬⎫
⎩⎨⎧=
⎭⎬⎫
⎩⎨⎧
⎟⎠⎞⎜
⎝⎛⎟⎠⎞
⎜⎝⎛=minus=
n
mjj
Tj
n
mjj
n
mjj
kTj
n
mj
n
mkkj
kn
mkk
Tj
n
mjj
yE
yyE
yyEmEm
11
2
1
2
1 1
11
2
ˆ
R φφ
φφ
φφxx
σ
ε
⎩⎨⎧
ne=
=kjkj
kTj if 0
if 1φφQ
[ ] 2
222
0
j
j
yE
yEyE
yE
jj
j
=
minus=
=
σ
We should discard the
bases where the projections have lower variances
original vector
reconstructed vector
ML-12
PCA Derivations (613)
bull Minimum Mean-Squared Error Criterionndash If the orthonormal (basis) set is selected to be the
eigenvectors of the correlation matrix associated with eigenvalues
bull They will have the property that
ndash Such that the mean-squared error mentioned above will be
siφR
siλ
jjj φR φ λ=
( )
sumsumsum
sum
+=+=+=
+=
===
=
n
mjj
n
mjjj
Tj
n
mjj
Tj
n
mjjm
111
1
2
λλ
σε
φφR φφ
is real and symmetric therefore its eigenvectors
form a orthonormal setR
is positive definite ( )=gt all eigenvalues are positive
R 0gtRxxT
ML-13
PCA Derivations (713)
bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest
eigenvalues the mean-squared error will be
ndash Any two projections and will be mutually uncorrelated
bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices
bull An two-dimensional example of Principle Component Analysis
ML-15
PCA Derivations (913)
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull Goal discover significant patterns or features from the input datandash Salient feature selection or dimensionality reduction
ndash Compute an input-output mapping based on some desirable properties
Networkx yInput space Feature space
W
ML-3
Introduction (23)
bull Principal Component Analysis (PCA)
bull Linear Discriminant Analysis (LDA)
bull Latent Semantic Analysis (LSA)
bull Heteroscedastic Discriminant Analysis (HDA)
ML-4
Introduction (33)
bull Formulation for discriminative feature extraction ndash Model-free (nonparametric)
bull Without prior information eg PCA bull With prior information eg LDA
ndash Model-dependent (parametric)bull Eg PLSA (Probabilistic Latent Semantic Analysis) with EM
(Expectation-Maximization) MCE (Minimum Classification Error) Training
ML-5
Principal Component Analysis (PCA) (12)
bull Known as Karhunen-Loẻve Transform (1947 1963)
ndash Or Hotelling Transform (1933)
bull A standard technique commonly used for data reduction in statistical pattern recognition and signal processing
bull A transform by which the data set can be represented by reduced number of effective features and still retain the most intrinsic information contentndash A small set of features to be found to represent the data samples
accurately
bull Also called ldquoSubspace Decompositionrdquo ldquoFactor Analysisrdquo
Pearson 1901
ML-6
Principal Component Analysis (PCA) (22)
The patterns show a significant differencefrom each other in one of the transformed axes
ML-7
PCA Derivations (113)
bull Suppose x is an n-dimensional zero mean random vectorndash If x is not zero mean we can subtract the mean
before processing the following analysis
ndash x can be represented without error by the summation of n linearly independent vectors
sum ===
n
iiiiy Φyφx [ ]Tni yyy where 1=y
[ ]ni φφφΦ 1=
0xμ x == E
The basis vectorsThe i-th component
in the feature (mapped) space
ML-8
PCA Derivations (213)
(23)
(01)
(10)
(11)(-11)
(52rsquo12rsquo)
⎥⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡minus+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡11
121
11 1
21
11
25
10
301
232
orthogonal basis sets
ML-9
PCA Derivations (313)
ndash Further assume the column (basis) vectors of the matrix form an orthonormal set
bull Such that is equal to the projectionof on
⎩⎨⎧
ne=
=jiji
jTi if 0
if 1φφ
iφxxx T
iiT
ii y ϕϕ ==forall
iy
x1ϕ
2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Φ
ML-10
PCA Derivations (413)
ndash Further assume the column (basis) vectors of the matrix form an orthonormal set
bull also has the following propertiesndash Its mean is zero too
ndash Its variance is
bull The correlation between two projections and is
Φiy
0==== 0xx Ti
Ti
Tii EEyE ϕϕϕ jy
( )( ) j
Tij
TTi
jTT
iTT
jTiji
E
EEyyE
Rφφφxxφ
φxxφxφxφ
==
==
[ ] [ ]xRR
xxxx
ofmatrix relation (auto-)cor theis
2222
iTi
iTT
iiTT
i EEyEyEyEii
i
ϕϕ
ϕϕϕϕσ
=
===minus=
( )( )
sum==
minus⎟⎠⎞
⎜⎝⎛
sumasymp
minusminus=
=
i
Tii
T
TN
i
Tii
T
NE
N
E
xxxxR
μμxx
μxμxΣ
1
1
][
1 0
iy jy
ML-11
PCA Derivations (513)
bull Minimum Mean-Squared Error Criterionndash We want to choose only m of that we still can
approximate well in mean-squared error criterionxsiφ
sumsumsum+===
+==n
mjjj
m
ii
n
ii yyy
111φφφx ii
( ) sum=
=m
iiym
1ˆ iφx
( ) ( )
sumsum
sum
sum sum
sumsum
+=+=
+=
+= +=
+=+=
==
=
⎭⎬⎫
⎩⎨⎧=
⎭⎬⎫
⎩⎨⎧
⎟⎠⎞⎜
⎝⎛⎟⎠⎞
⎜⎝⎛=minus=
n
mjj
Tj
n
mjj
n
mjj
kTj
n
mj
n
mkkj
kn
mkk
Tj
n
mjj
yE
yyE
yyEmEm
11
2
1
2
1 1
11
2
ˆ
R φφ
φφ
φφxx
σ
ε
⎩⎨⎧
ne=
=kjkj
kTj if 0
if 1φφQ
[ ] 2
222
0
j
j
yE
yEyE
yE
jj
j
=
minus=
=
σ
We should discard the
bases where the projections have lower variances
original vector
reconstructed vector
ML-12
PCA Derivations (613)
bull Minimum Mean-Squared Error Criterionndash If the orthonormal (basis) set is selected to be the
eigenvectors of the correlation matrix associated with eigenvalues
bull They will have the property that
ndash Such that the mean-squared error mentioned above will be
siφR
siλ
jjj φR φ λ=
( )
sumsumsum
sum
+=+=+=
+=
===
=
n
mjj
n
mjjj
Tj
n
mjj
Tj
n
mjjm
111
1
2
λλ
σε
φφR φφ
is real and symmetric therefore its eigenvectors
form a orthonormal setR
is positive definite ( )=gt all eigenvalues are positive
R 0gtRxxT
ML-13
PCA Derivations (713)
bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest
eigenvalues the mean-squared error will be
ndash Any two projections and will be mutually uncorrelated
bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices
bull An two-dimensional example of Principle Component Analysis
ML-15
PCA Derivations (913)
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull Formulation for discriminative feature extraction ndash Model-free (nonparametric)
bull Without prior information eg PCA bull With prior information eg LDA
ndash Model-dependent (parametric)bull Eg PLSA (Probabilistic Latent Semantic Analysis) with EM
(Expectation-Maximization) MCE (Minimum Classification Error) Training
ML-5
Principal Component Analysis (PCA) (12)
bull Known as Karhunen-Loẻve Transform (1947 1963)
ndash Or Hotelling Transform (1933)
bull A standard technique commonly used for data reduction in statistical pattern recognition and signal processing
bull A transform by which the data set can be represented by reduced number of effective features and still retain the most intrinsic information contentndash A small set of features to be found to represent the data samples
accurately
bull Also called ldquoSubspace Decompositionrdquo ldquoFactor Analysisrdquo
Pearson 1901
ML-6
Principal Component Analysis (PCA) (22)
The patterns show a significant differencefrom each other in one of the transformed axes
ML-7
PCA Derivations (113)
bull Suppose x is an n-dimensional zero mean random vectorndash If x is not zero mean we can subtract the mean
before processing the following analysis
ndash x can be represented without error by the summation of n linearly independent vectors
sum ===
n
iiiiy Φyφx [ ]Tni yyy where 1=y
[ ]ni φφφΦ 1=
0xμ x == E
The basis vectorsThe i-th component
in the feature (mapped) space
ML-8
PCA Derivations (213)
(23)
(01)
(10)
(11)(-11)
(52rsquo12rsquo)
⎥⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡minus+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡11
121
11 1
21
11
25
10
301
232
orthogonal basis sets
ML-9
PCA Derivations (313)
ndash Further assume the column (basis) vectors of the matrix form an orthonormal set
bull Such that is equal to the projectionof on
⎩⎨⎧
ne=
=jiji
jTi if 0
if 1φφ
iφxxx T
iiT
ii y ϕϕ ==forall
iy
x1ϕ
2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Φ
ML-10
PCA Derivations (413)
ndash Further assume the column (basis) vectors of the matrix form an orthonormal set
bull also has the following propertiesndash Its mean is zero too
ndash Its variance is
bull The correlation between two projections and is
Φiy
0==== 0xx Ti
Ti
Tii EEyE ϕϕϕ jy
( )( ) j
Tij
TTi
jTT
iTT
jTiji
E
EEyyE
Rφφφxxφ
φxxφxφxφ
==
==
[ ] [ ]xRR
xxxx
ofmatrix relation (auto-)cor theis
2222
iTi
iTT
iiTT
i EEyEyEyEii
i
ϕϕ
ϕϕϕϕσ
=
===minus=
( )( )
sum==
minus⎟⎠⎞
⎜⎝⎛
sumasymp
minusminus=
=
i
Tii
T
TN
i
Tii
T
NE
N
E
xxxxR
μμxx
μxμxΣ
1
1
][
1 0
iy jy
ML-11
PCA Derivations (513)
bull Minimum Mean-Squared Error Criterionndash We want to choose only m of that we still can
approximate well in mean-squared error criterionxsiφ
sumsumsum+===
+==n
mjjj
m
ii
n
ii yyy
111φφφx ii
( ) sum=
=m
iiym
1ˆ iφx
( ) ( )
sumsum
sum
sum sum
sumsum
+=+=
+=
+= +=
+=+=
==
=
⎭⎬⎫
⎩⎨⎧=
⎭⎬⎫
⎩⎨⎧
⎟⎠⎞⎜
⎝⎛⎟⎠⎞
⎜⎝⎛=minus=
n
mjj
Tj
n
mjj
n
mjj
kTj
n
mj
n
mkkj
kn
mkk
Tj
n
mjj
yE
yyE
yyEmEm
11
2
1
2
1 1
11
2
ˆ
R φφ
φφ
φφxx
σ
ε
⎩⎨⎧
ne=
=kjkj
kTj if 0
if 1φφQ
[ ] 2
222
0
j
j
yE
yEyE
yE
jj
j
=
minus=
=
σ
We should discard the
bases where the projections have lower variances
original vector
reconstructed vector
ML-12
PCA Derivations (613)
bull Minimum Mean-Squared Error Criterionndash If the orthonormal (basis) set is selected to be the
eigenvectors of the correlation matrix associated with eigenvalues
bull They will have the property that
ndash Such that the mean-squared error mentioned above will be
siφR
siλ
jjj φR φ λ=
( )
sumsumsum
sum
+=+=+=
+=
===
=
n
mjj
n
mjjj
Tj
n
mjj
Tj
n
mjjm
111
1
2
λλ
σε
φφR φφ
is real and symmetric therefore its eigenvectors
form a orthonormal setR
is positive definite ( )=gt all eigenvalues are positive
R 0gtRxxT
ML-13
PCA Derivations (713)
bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest
eigenvalues the mean-squared error will be
ndash Any two projections and will be mutually uncorrelated
bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices
bull An two-dimensional example of Principle Component Analysis
ML-15
PCA Derivations (913)
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull Formulation for discriminative feature extraction ndash Model-free (nonparametric)
bull Without prior information eg PCA bull With prior information eg LDA
ndash Model-dependent (parametric)bull Eg PLSA (Probabilistic Latent Semantic Analysis) with EM
(Expectation-Maximization) MCE (Minimum Classification Error) Training
ML-5
Principal Component Analysis (PCA) (12)
bull Known as Karhunen-Loẻve Transform (1947 1963)
ndash Or Hotelling Transform (1933)
bull A standard technique commonly used for data reduction in statistical pattern recognition and signal processing
bull A transform by which the data set can be represented by reduced number of effective features and still retain the most intrinsic information contentndash A small set of features to be found to represent the data samples
accurately
bull Also called ldquoSubspace Decompositionrdquo ldquoFactor Analysisrdquo
Pearson 1901
ML-6
Principal Component Analysis (PCA) (22)
The patterns show a significant differencefrom each other in one of the transformed axes
ML-7
PCA Derivations (113)
bull Suppose x is an n-dimensional zero mean random vectorndash If x is not zero mean we can subtract the mean
before processing the following analysis
ndash x can be represented without error by the summation of n linearly independent vectors
sum ===
n
iiiiy Φyφx [ ]Tni yyy where 1=y
[ ]ni φφφΦ 1=
0xμ x == E
The basis vectorsThe i-th component
in the feature (mapped) space
ML-8
PCA Derivations (213)
(23)
(01)
(10)
(11)(-11)
(52rsquo12rsquo)
⎥⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡minus+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡11
121
11 1
21
11
25
10
301
232
orthogonal basis sets
ML-9
PCA Derivations (313)
ndash Further assume the column (basis) vectors of the matrix form an orthonormal set
bull Such that is equal to the projectionof on
⎩⎨⎧
ne=
=jiji
jTi if 0
if 1φφ
iφxxx T
iiT
ii y ϕϕ ==forall
iy
x1ϕ
2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Φ
ML-10
PCA Derivations (413)
ndash Further assume the column (basis) vectors of the matrix form an orthonormal set
bull also has the following propertiesndash Its mean is zero too
ndash Its variance is
bull The correlation between two projections and is
Φiy
0==== 0xx Ti
Ti
Tii EEyE ϕϕϕ jy
( )( ) j
Tij
TTi
jTT
iTT
jTiji
E
EEyyE
Rφφφxxφ
φxxφxφxφ
==
==
[ ] [ ]xRR
xxxx
ofmatrix relation (auto-)cor theis
2222
iTi
iTT
iiTT
i EEyEyEyEii
i
ϕϕ
ϕϕϕϕσ
=
===minus=
( )( )
sum==
minus⎟⎠⎞
⎜⎝⎛
sumasymp
minusminus=
=
i
Tii
T
TN
i
Tii
T
NE
N
E
xxxxR
μμxx
μxμxΣ
1
1
][
1 0
iy jy
ML-11
PCA Derivations (513)
bull Minimum Mean-Squared Error Criterionndash We want to choose only m of that we still can
approximate well in mean-squared error criterionxsiφ
sumsumsum+===
+==n
mjjj
m
ii
n
ii yyy
111φφφx ii
( ) sum=
=m
iiym
1ˆ iφx
( ) ( )
sumsum
sum
sum sum
sumsum
+=+=
+=
+= +=
+=+=
==
=
⎭⎬⎫
⎩⎨⎧=
⎭⎬⎫
⎩⎨⎧
⎟⎠⎞⎜
⎝⎛⎟⎠⎞
⎜⎝⎛=minus=
n
mjj
Tj
n
mjj
n
mjj
kTj
n
mj
n
mkkj
kn
mkk
Tj
n
mjj
yE
yyE
yyEmEm
11
2
1
2
1 1
11
2
ˆ
R φφ
φφ
φφxx
σ
ε
⎩⎨⎧
ne=
=kjkj
kTj if 0
if 1φφQ
[ ] 2
222
0
j
j
yE
yEyE
yE
jj
j
=
minus=
=
σ
We should discard the
bases where the projections have lower variances
original vector
reconstructed vector
ML-12
PCA Derivations (613)
bull Minimum Mean-Squared Error Criterionndash If the orthonormal (basis) set is selected to be the
eigenvectors of the correlation matrix associated with eigenvalues
bull They will have the property that
ndash Such that the mean-squared error mentioned above will be
siφR
siλ
jjj φR φ λ=
( )
sumsumsum
sum
+=+=+=
+=
===
=
n
mjj
n
mjjj
Tj
n
mjj
Tj
n
mjjm
111
1
2
λλ
σε
φφR φφ
is real and symmetric therefore its eigenvectors
form a orthonormal setR
is positive definite ( )=gt all eigenvalues are positive
R 0gtRxxT
ML-13
PCA Derivations (713)
bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest
eigenvalues the mean-squared error will be
ndash Any two projections and will be mutually uncorrelated
bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices
bull An two-dimensional example of Principle Component Analysis
ML-15
PCA Derivations (913)
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull Known as Karhunen-Loẻve Transform (1947 1963)
ndash Or Hotelling Transform (1933)
bull A standard technique commonly used for data reduction in statistical pattern recognition and signal processing
bull A transform by which the data set can be represented by reduced number of effective features and still retain the most intrinsic information contentndash A small set of features to be found to represent the data samples
accurately
bull Also called ldquoSubspace Decompositionrdquo ldquoFactor Analysisrdquo
Pearson 1901
ML-6
Principal Component Analysis (PCA) (22)
The patterns show a significant differencefrom each other in one of the transformed axes
ML-7
PCA Derivations (113)
bull Suppose x is an n-dimensional zero mean random vectorndash If x is not zero mean we can subtract the mean
before processing the following analysis
ndash x can be represented without error by the summation of n linearly independent vectors
sum ===
n
iiiiy Φyφx [ ]Tni yyy where 1=y
[ ]ni φφφΦ 1=
0xμ x == E
The basis vectorsThe i-th component
in the feature (mapped) space
ML-8
PCA Derivations (213)
(23)
(01)
(10)
(11)(-11)
(52rsquo12rsquo)
⎥⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡minus+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡11
121
11 1
21
11
25
10
301
232
orthogonal basis sets
ML-9
PCA Derivations (313)
ndash Further assume the column (basis) vectors of the matrix form an orthonormal set
bull Such that is equal to the projectionof on
⎩⎨⎧
ne=
=jiji
jTi if 0
if 1φφ
iφxxx T
iiT
ii y ϕϕ ==forall
iy
x1ϕ
2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Φ
ML-10
PCA Derivations (413)
ndash Further assume the column (basis) vectors of the matrix form an orthonormal set
bull also has the following propertiesndash Its mean is zero too
ndash Its variance is
bull The correlation between two projections and is
Φiy
0==== 0xx Ti
Ti
Tii EEyE ϕϕϕ jy
( )( ) j
Tij
TTi
jTT
iTT
jTiji
E
EEyyE
Rφφφxxφ
φxxφxφxφ
==
==
[ ] [ ]xRR
xxxx
ofmatrix relation (auto-)cor theis
2222
iTi
iTT
iiTT
i EEyEyEyEii
i
ϕϕ
ϕϕϕϕσ
=
===minus=
( )( )
sum==
minus⎟⎠⎞
⎜⎝⎛
sumasymp
minusminus=
=
i
Tii
T
TN
i
Tii
T
NE
N
E
xxxxR
μμxx
μxμxΣ
1
1
][
1 0
iy jy
ML-11
PCA Derivations (513)
bull Minimum Mean-Squared Error Criterionndash We want to choose only m of that we still can
approximate well in mean-squared error criterionxsiφ
sumsumsum+===
+==n
mjjj
m
ii
n
ii yyy
111φφφx ii
( ) sum=
=m
iiym
1ˆ iφx
( ) ( )
sumsum
sum
sum sum
sumsum
+=+=
+=
+= +=
+=+=
==
=
⎭⎬⎫
⎩⎨⎧=
⎭⎬⎫
⎩⎨⎧
⎟⎠⎞⎜
⎝⎛⎟⎠⎞
⎜⎝⎛=minus=
n
mjj
Tj
n
mjj
n
mjj
kTj
n
mj
n
mkkj
kn
mkk
Tj
n
mjj
yE
yyE
yyEmEm
11
2
1
2
1 1
11
2
ˆ
R φφ
φφ
φφxx
σ
ε
⎩⎨⎧
ne=
=kjkj
kTj if 0
if 1φφQ
[ ] 2
222
0
j
j
yE
yEyE
yE
jj
j
=
minus=
=
σ
We should discard the
bases where the projections have lower variances
original vector
reconstructed vector
ML-12
PCA Derivations (613)
bull Minimum Mean-Squared Error Criterionndash If the orthonormal (basis) set is selected to be the
eigenvectors of the correlation matrix associated with eigenvalues
bull They will have the property that
ndash Such that the mean-squared error mentioned above will be
siφR
siλ
jjj φR φ λ=
( )
sumsumsum
sum
+=+=+=
+=
===
=
n
mjj
n
mjjj
Tj
n
mjj
Tj
n
mjjm
111
1
2
λλ
σε
φφR φφ
is real and symmetric therefore its eigenvectors
form a orthonormal setR
is positive definite ( )=gt all eigenvalues are positive
R 0gtRxxT
ML-13
PCA Derivations (713)
bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest
eigenvalues the mean-squared error will be
ndash Any two projections and will be mutually uncorrelated
bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices
bull An two-dimensional example of Principle Component Analysis
ML-15
PCA Derivations (913)
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull An two-dimensional example of Principle Component Analysis
ML-15
PCA Derivations (913)
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull An two-dimensional example of Principle Component Analysis
ML-15
PCA Derivations (913)
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull An two-dimensional example of Principle Component Analysis
ML-15
PCA Derivations (913)
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull An two-dimensional example of Principle Component Analysis
ML-15
PCA Derivations (913)
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull An two-dimensional example of Principle Component Analysis
ML-15
PCA Derivations (913)
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull An two-dimensional example of Principle Component Analysis
ML-15
PCA Derivations (913)
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull An two-dimensional example of Principle Component Analysis
ML-15
PCA Derivations (913)
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull An two-dimensional example of Principle Component Analysis
ML-15
PCA Derivations (913)
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull An two-dimensional example of Principle Component Analysis
ML-15
PCA Derivations (913)
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A