discriminative feature extraction and dimension...

65
Discriminative Feature Extraction and Dimension Reduction - PCA, LDA and LSA References: 1. Introduction to Machine Learning , Chapter 6 2. Data Mining: Concepts, Models, Methods and Algorithms, Chapter 3 Berlin Chen Graduate Institute of Computer Science & Information Engineering National Taiwan Normal University

Upload: others

Post on 16-Aug-2020

21 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

Discriminative Feature Extraction and Dimension Reduction

- PCA LDA and LSA

References1 Introduction to Machine Learning Chapter 6 2 Data Mining Concepts Models Methods and Algorithms Chapter 3

Berlin ChenGraduate Institute of Computer Science amp Information Engineering

National Taiwan Normal University

ML-2

Introduction (13)

bull Goal discover significant patterns or features from the input datandash Salient feature selection or dimensionality reduction

ndash Compute an input-output mapping based on some desirable properties

Networkx yInput space Feature space

W

ML-3

Introduction (23)

bull Principal Component Analysis (PCA)

bull Linear Discriminant Analysis (LDA)

bull Latent Semantic Analysis (LSA)

bull Heteroscedastic Discriminant Analysis (HDA)

ML-4

Introduction (33)

bull Formulation for discriminative feature extraction ndash Model-free (nonparametric)

bull Without prior information eg PCA bull With prior information eg LDA

ndash Model-dependent (parametric)bull Eg PLSA (Probabilistic Latent Semantic Analysis) with EM

(Expectation-Maximization) MCE (Minimum Classification Error) Training

ML-5

Principal Component Analysis (PCA) (12)

bull Known as Karhunen-Loẻve Transform (1947 1963)

ndash Or Hotelling Transform (1933)

bull A standard technique commonly used for data reduction in statistical pattern recognition and signal processing

bull A transform by which the data set can be represented by reduced number of effective features and still retain the most intrinsic information contentndash A small set of features to be found to represent the data samples

accurately

bull Also called ldquoSubspace Decompositionrdquo ldquoFactor Analysisrdquo

Pearson 1901

ML-6

Principal Component Analysis (PCA) (22)

The patterns show a significant differencefrom each other in one of the transformed axes

ML-7

PCA Derivations (113)

bull Suppose x is an n-dimensional zero mean random vectorndash If x is not zero mean we can subtract the mean

before processing the following analysis

ndash x can be represented without error by the summation of n linearly independent vectors

sum ===

n

iiiiy Φyφx [ ]Tni yyy where 1=y

[ ]ni φφφΦ 1=

0xμ x == E

The basis vectorsThe i-th component

in the feature (mapped) space

ML-8

PCA Derivations (213)

(23)

(01)

(10)

(11)(-11)

(52rsquo12rsquo)

⎥⎦

⎤⎢⎣

⎡+⎥

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡minus+⎥

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡+⎥

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡11

121

11 1

21

11

25

10

301

232

orthogonal basis sets

ML-9

PCA Derivations (313)

ndash Further assume the column (basis) vectors of the matrix form an orthonormal set

bull Such that is equal to the projectionof on

⎩⎨⎧

ne=

=jiji

jTi if 0

if 1φφ

iφxxx T

iiT

ii y ϕϕ ==forall

iy

x1ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Φ

ML-10

PCA Derivations (413)

ndash Further assume the column (basis) vectors of the matrix form an orthonormal set

bull also has the following propertiesndash Its mean is zero too

ndash Its variance is

bull The correlation between two projections and is

Φiy

0==== 0xx Ti

Ti

Tii EEyE ϕϕϕ jy

( )( ) j

Tij

TTi

jTT

iTT

jTiji

E

EEyyE

Rφφφxxφ

φxxφxφxφ

==

==

[ ] [ ]xRR

xxxx

ofmatrix relation (auto-)cor theis

2222

iTi

iTT

iiTT

i EEyEyEyEii

i

ϕϕ

ϕϕϕϕσ

=

===minus=

( )( )

sum==

minus⎟⎠⎞

⎜⎝⎛

sumasymp

minusminus=

=

i

Tii

T

TN

i

Tii

T

NE

N

E

xxxxR

μμxx

μxμxΣ

1

1

][

1 0

iy jy

ML-11

PCA Derivations (513)

bull Minimum Mean-Squared Error Criterionndash We want to choose only m of that we still can

approximate well in mean-squared error criterionxsiφ

sumsumsum+===

+==n

mjjj

m

ii

n

ii yyy

111φφφx ii

( ) sum=

=m

iiym

1ˆ iφx

( ) ( )

sumsum

sum

sum sum

sumsum

+=+=

+=

+= +=

+=+=

==

=

⎭⎬⎫

⎩⎨⎧=

⎭⎬⎫

⎩⎨⎧

⎟⎠⎞⎜

⎝⎛⎟⎠⎞

⎜⎝⎛=minus=

n

mjj

Tj

n

mjj

n

mjj

kTj

n

mj

n

mkkj

kn

mkk

Tj

n

mjj

yE

yyE

yyEmEm

11

2

1

2

1 1

11

2

ˆ

R φφ

φφ

φφxx

σ

ε

⎩⎨⎧

ne=

=kjkj

kTj if 0

if 1φφQ

[ ] 2

222

0

j

j

yE

yEyE

yE

jj

j

=

minus=

=

σ

We should discard the

bases where the projections have lower variances

original vector

reconstructed vector

ML-12

PCA Derivations (613)

bull Minimum Mean-Squared Error Criterionndash If the orthonormal (basis) set is selected to be the

eigenvectors of the correlation matrix associated with eigenvalues

bull They will have the property that

ndash Such that the mean-squared error mentioned above will be

siφR

siλ

jjj φR φ λ=

( )

sumsumsum

sum

+=+=+=

+=

===

=

n

mjj

n

mjjj

Tj

n

mjj

Tj

n

mjjm

111

1

2

λλ

σε

φφR φφ

is real and symmetric therefore its eigenvectors

form a orthonormal setR

is positive definite ( )=gt all eigenvalues are positive

R 0gtRxxT

ML-13

PCA Derivations (713)

bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest

eigenvalues the mean-squared error will be

ndash Any two projections and will be mutually uncorrelated

bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices

( ) ( )0 where 11

gegegegegesum=+=

nmn

mjjeigen m λλλλε

iy jy

( )( ) 0 j ====

==

jTij

Tij

TTi

jTT

iTT

jTiji

E

EEyyE

φφRφφφxxφ

φxxφxφxφ

λ

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

sdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdot

sdotsdot

nnnn

n

σσσ

σσσσσ

21

2222

11211

( )( )

sum==

minus⎟⎠⎞

⎜⎝⎛

sumasymp

minusminus=

=

i

Tii

T

TN

i

Tii

T

NE

N

E

xxxxR

μμxx

μxμxΣ

1

1

][

1

ML-14

PCA Derivations (813)

bull An two-dimensional example of Principle Component Analysis

ML-15

PCA Derivations (913)

bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the

mean-squared error criterion ( ) meigenε

( )

[ ]( )[ ]( )

[ ] [ ][ ]( )nmmnmnmnmn

nmmnnm

nmmnjmnjnjm

jnmjTn

mkkjkj

jnjm

n

mj

n

mkjkk

Tjjk

n

mjj

Tj

μμμJ

μJ

j

μμUUΦRΦ

μμΦφφR

φφΦμΦRφ

μ0φRφφ

φφRφφ

where

where

where 22

Define

1

11

11

1 1

1

1 11

+minusminusminusminus

+minus+

+minusminuslele+

++=

lele+

+= +=+=

==rArr

=rArr

==forallrArr

==summinus=partpart

forallrArr

sum sum minusminussum= δTake derivation

Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors

mnminusUnm λλ 1+ R nm φφ 1+

ϕϕϕϕ RR 2=

partpart T

constraintsTo be minimized ⎩⎨⎧

ne=

=kjkj

jk if 0 if 1

δ

Objective function

ML-16

PCA Derivations (1013)

bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)

such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion

Encoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

2

1

x

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

Decoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

ˆˆˆ

ˆ2

1

x

( ) ( )( )-xx-xxE Tx ˆˆ minimize

xΦy Tprime=

[ ]m

T

eeeΦΦ

where

21=primeprime

yΦx prime=ˆ

Φ prime

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

ML-17

PCA Derivations (1113)

bull Data compression in communication

ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition

ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns

(To be discussed later on)

ML-18

PCA Derivations (1213)

bull Scree Graphndash The plot of variance as a function of the number of eigenvectors

keptbull Select such that

bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)

Thresholdnm

m ge+++++

+++λλλλ

λλλLL

L

21

21m

sumge=

n

iim n 1

1 λλ

ML-19

PCA Derivations (1313)

bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal

( ) WSWWSWSSSW bT

wT

bwJ +=+==~~~

WSWS wT

w =~

WSWS bT

b =~

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

( )( )sum minusminus=i

TiiN

xxxxS 1

sample index

class index

bw SSS += thatshowTry to

ML-20

PCA Examples Data Analysis

bull Example 1 principal components of some data points

ML-21

PCA Examples Feature Transformation

bull Example 2 feature transformation and selection

threshold for information content reserved

New feature dimensions

Correlation matrix for old feature dimensions

ML-22

PCA Examples Image Coding (12)

bull Example 3 Image Coding

256

256

8

8

ML-23

PCA Examples Image Coding (22)

bull Example 3 Image Coding (cont)(value reduction)(feature reduction)

ML-24

PCA Examples Eigenface (14)

bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)

ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images

ndash Stepsbull Convert each of the L reference images into a vector of

floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these

reference vectors bull Apply Principal Component Analysis (PCA) find the

eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are

called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nL

L

L

L

nn x

xx

x

xx

x

xx

2

1

2

22

12

2

1

21

11

1

xxx

ML-25

PCA Examples Eigenface (24)

bull Example 4 Eigenface in face recognition (cont) ndash Steps

bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces

ndash The Eigenface approach persists the minimum mean-squared error criterion

ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces

( ) ( ) ( )[ ]Kiiii

Kiiii

wwwKwww

21

21

121ˆ

=rArr

++++=

yeeexx

Feature vector of a person i

ML-26

PCA Examples Eigenface (34)

The averaged face

Face images as the training set

ML-27

PCA Examples Eigenface (44)

A projected Face imageSeven eigenfaces derived from the training set

(Indicate directions of variations between faces )

ML-28

PCA Examples Eigenvoice (13)

bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)

ndash Stepsbull Concatenating the regarded parameters for each speaker r to

form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)

Eigenvoice Eigenvoice space space

constructionconstruction

Speaker 1 Data

SI HMM

Speaker R Data

Model Training Model Training

Speaker 1 HMM Speaker R HMM

D = (Mn)times1 Principal Component

Analysis

Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace

( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=

SI HMM model

ML-29

PCA Examples Eigenvoice (23)

bull Example 4 Eigenvoice in speaker adaptation (cont)

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 2: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-2

Introduction (13)

bull Goal discover significant patterns or features from the input datandash Salient feature selection or dimensionality reduction

ndash Compute an input-output mapping based on some desirable properties

Networkx yInput space Feature space

W

ML-3

Introduction (23)

bull Principal Component Analysis (PCA)

bull Linear Discriminant Analysis (LDA)

bull Latent Semantic Analysis (LSA)

bull Heteroscedastic Discriminant Analysis (HDA)

ML-4

Introduction (33)

bull Formulation for discriminative feature extraction ndash Model-free (nonparametric)

bull Without prior information eg PCA bull With prior information eg LDA

ndash Model-dependent (parametric)bull Eg PLSA (Probabilistic Latent Semantic Analysis) with EM

(Expectation-Maximization) MCE (Minimum Classification Error) Training

ML-5

Principal Component Analysis (PCA) (12)

bull Known as Karhunen-Loẻve Transform (1947 1963)

ndash Or Hotelling Transform (1933)

bull A standard technique commonly used for data reduction in statistical pattern recognition and signal processing

bull A transform by which the data set can be represented by reduced number of effective features and still retain the most intrinsic information contentndash A small set of features to be found to represent the data samples

accurately

bull Also called ldquoSubspace Decompositionrdquo ldquoFactor Analysisrdquo

Pearson 1901

ML-6

Principal Component Analysis (PCA) (22)

The patterns show a significant differencefrom each other in one of the transformed axes

ML-7

PCA Derivations (113)

bull Suppose x is an n-dimensional zero mean random vectorndash If x is not zero mean we can subtract the mean

before processing the following analysis

ndash x can be represented without error by the summation of n linearly independent vectors

sum ===

n

iiiiy Φyφx [ ]Tni yyy where 1=y

[ ]ni φφφΦ 1=

0xμ x == E

The basis vectorsThe i-th component

in the feature (mapped) space

ML-8

PCA Derivations (213)

(23)

(01)

(10)

(11)(-11)

(52rsquo12rsquo)

⎥⎦

⎤⎢⎣

⎡+⎥

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡minus+⎥

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡+⎥

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡11

121

11 1

21

11

25

10

301

232

orthogonal basis sets

ML-9

PCA Derivations (313)

ndash Further assume the column (basis) vectors of the matrix form an orthonormal set

bull Such that is equal to the projectionof on

⎩⎨⎧

ne=

=jiji

jTi if 0

if 1φφ

iφxxx T

iiT

ii y ϕϕ ==forall

iy

x1ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Φ

ML-10

PCA Derivations (413)

ndash Further assume the column (basis) vectors of the matrix form an orthonormal set

bull also has the following propertiesndash Its mean is zero too

ndash Its variance is

bull The correlation between two projections and is

Φiy

0==== 0xx Ti

Ti

Tii EEyE ϕϕϕ jy

( )( ) j

Tij

TTi

jTT

iTT

jTiji

E

EEyyE

Rφφφxxφ

φxxφxφxφ

==

==

[ ] [ ]xRR

xxxx

ofmatrix relation (auto-)cor theis

2222

iTi

iTT

iiTT

i EEyEyEyEii

i

ϕϕ

ϕϕϕϕσ

=

===minus=

( )( )

sum==

minus⎟⎠⎞

⎜⎝⎛

sumasymp

minusminus=

=

i

Tii

T

TN

i

Tii

T

NE

N

E

xxxxR

μμxx

μxμxΣ

1

1

][

1 0

iy jy

ML-11

PCA Derivations (513)

bull Minimum Mean-Squared Error Criterionndash We want to choose only m of that we still can

approximate well in mean-squared error criterionxsiφ

sumsumsum+===

+==n

mjjj

m

ii

n

ii yyy

111φφφx ii

( ) sum=

=m

iiym

1ˆ iφx

( ) ( )

sumsum

sum

sum sum

sumsum

+=+=

+=

+= +=

+=+=

==

=

⎭⎬⎫

⎩⎨⎧=

⎭⎬⎫

⎩⎨⎧

⎟⎠⎞⎜

⎝⎛⎟⎠⎞

⎜⎝⎛=minus=

n

mjj

Tj

n

mjj

n

mjj

kTj

n

mj

n

mkkj

kn

mkk

Tj

n

mjj

yE

yyE

yyEmEm

11

2

1

2

1 1

11

2

ˆ

R φφ

φφ

φφxx

σ

ε

⎩⎨⎧

ne=

=kjkj

kTj if 0

if 1φφQ

[ ] 2

222

0

j

j

yE

yEyE

yE

jj

j

=

minus=

=

σ

We should discard the

bases where the projections have lower variances

original vector

reconstructed vector

ML-12

PCA Derivations (613)

bull Minimum Mean-Squared Error Criterionndash If the orthonormal (basis) set is selected to be the

eigenvectors of the correlation matrix associated with eigenvalues

bull They will have the property that

ndash Such that the mean-squared error mentioned above will be

siφR

siλ

jjj φR φ λ=

( )

sumsumsum

sum

+=+=+=

+=

===

=

n

mjj

n

mjjj

Tj

n

mjj

Tj

n

mjjm

111

1

2

λλ

σε

φφR φφ

is real and symmetric therefore its eigenvectors

form a orthonormal setR

is positive definite ( )=gt all eigenvalues are positive

R 0gtRxxT

ML-13

PCA Derivations (713)

bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest

eigenvalues the mean-squared error will be

ndash Any two projections and will be mutually uncorrelated

bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices

( ) ( )0 where 11

gegegegegesum=+=

nmn

mjjeigen m λλλλε

iy jy

( )( ) 0 j ====

==

jTij

Tij

TTi

jTT

iTT

jTiji

E

EEyyE

φφRφφφxxφ

φxxφxφxφ

λ

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

sdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdot

sdotsdot

nnnn

n

σσσ

σσσσσ

21

2222

11211

( )( )

sum==

minus⎟⎠⎞

⎜⎝⎛

sumasymp

minusminus=

=

i

Tii

T

TN

i

Tii

T

NE

N

E

xxxxR

μμxx

μxμxΣ

1

1

][

1

ML-14

PCA Derivations (813)

bull An two-dimensional example of Principle Component Analysis

ML-15

PCA Derivations (913)

bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the

mean-squared error criterion ( ) meigenε

( )

[ ]( )[ ]( )

[ ] [ ][ ]( )nmmnmnmnmn

nmmnnm

nmmnjmnjnjm

jnmjTn

mkkjkj

jnjm

n

mj

n

mkjkk

Tjjk

n

mjj

Tj

μμμJ

μJ

j

μμUUΦRΦ

μμΦφφR

φφΦμΦRφ

μ0φRφφ

φφRφφ

where

where

where 22

Define

1

11

11

1 1

1

1 11

+minusminusminusminus

+minus+

+minusminuslele+

++=

lele+

+= +=+=

==rArr

=rArr

==forallrArr

==summinus=partpart

forallrArr

sum sum minusminussum= δTake derivation

Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors

mnminusUnm λλ 1+ R nm φφ 1+

ϕϕϕϕ RR 2=

partpart T

constraintsTo be minimized ⎩⎨⎧

ne=

=kjkj

jk if 0 if 1

δ

Objective function

ML-16

PCA Derivations (1013)

bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)

such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion

Encoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

2

1

x

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

Decoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

ˆˆˆ

ˆ2

1

x

( ) ( )( )-xx-xxE Tx ˆˆ minimize

xΦy Tprime=

[ ]m

T

eeeΦΦ

where

21=primeprime

yΦx prime=ˆ

Φ prime

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

ML-17

PCA Derivations (1113)

bull Data compression in communication

ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition

ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns

(To be discussed later on)

ML-18

PCA Derivations (1213)

bull Scree Graphndash The plot of variance as a function of the number of eigenvectors

keptbull Select such that

bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)

Thresholdnm

m ge+++++

+++λλλλ

λλλLL

L

21

21m

sumge=

n

iim n 1

1 λλ

ML-19

PCA Derivations (1313)

bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal

( ) WSWWSWSSSW bT

wT

bwJ +=+==~~~

WSWS wT

w =~

WSWS bT

b =~

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

( )( )sum minusminus=i

TiiN

xxxxS 1

sample index

class index

bw SSS += thatshowTry to

ML-20

PCA Examples Data Analysis

bull Example 1 principal components of some data points

ML-21

PCA Examples Feature Transformation

bull Example 2 feature transformation and selection

threshold for information content reserved

New feature dimensions

Correlation matrix for old feature dimensions

ML-22

PCA Examples Image Coding (12)

bull Example 3 Image Coding

256

256

8

8

ML-23

PCA Examples Image Coding (22)

bull Example 3 Image Coding (cont)(value reduction)(feature reduction)

ML-24

PCA Examples Eigenface (14)

bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)

ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images

ndash Stepsbull Convert each of the L reference images into a vector of

floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these

reference vectors bull Apply Principal Component Analysis (PCA) find the

eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are

called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nL

L

L

L

nn x

xx

x

xx

x

xx

2

1

2

22

12

2

1

21

11

1

xxx

ML-25

PCA Examples Eigenface (24)

bull Example 4 Eigenface in face recognition (cont) ndash Steps

bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces

ndash The Eigenface approach persists the minimum mean-squared error criterion

ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces

( ) ( ) ( )[ ]Kiiii

Kiiii

wwwKwww

21

21

121ˆ

=rArr

++++=

yeeexx

Feature vector of a person i

ML-26

PCA Examples Eigenface (34)

The averaged face

Face images as the training set

ML-27

PCA Examples Eigenface (44)

A projected Face imageSeven eigenfaces derived from the training set

(Indicate directions of variations between faces )

ML-28

PCA Examples Eigenvoice (13)

bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)

ndash Stepsbull Concatenating the regarded parameters for each speaker r to

form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)

Eigenvoice Eigenvoice space space

constructionconstruction

Speaker 1 Data

SI HMM

Speaker R Data

Model Training Model Training

Speaker 1 HMM Speaker R HMM

D = (Mn)times1 Principal Component

Analysis

Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace

( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=

SI HMM model

ML-29

PCA Examples Eigenvoice (23)

bull Example 4 Eigenvoice in speaker adaptation (cont)

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 3: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-3

Introduction (23)

bull Principal Component Analysis (PCA)

bull Linear Discriminant Analysis (LDA)

bull Latent Semantic Analysis (LSA)

bull Heteroscedastic Discriminant Analysis (HDA)

ML-4

Introduction (33)

bull Formulation for discriminative feature extraction ndash Model-free (nonparametric)

bull Without prior information eg PCA bull With prior information eg LDA

ndash Model-dependent (parametric)bull Eg PLSA (Probabilistic Latent Semantic Analysis) with EM

(Expectation-Maximization) MCE (Minimum Classification Error) Training

ML-5

Principal Component Analysis (PCA) (12)

bull Known as Karhunen-Loẻve Transform (1947 1963)

ndash Or Hotelling Transform (1933)

bull A standard technique commonly used for data reduction in statistical pattern recognition and signal processing

bull A transform by which the data set can be represented by reduced number of effective features and still retain the most intrinsic information contentndash A small set of features to be found to represent the data samples

accurately

bull Also called ldquoSubspace Decompositionrdquo ldquoFactor Analysisrdquo

Pearson 1901

ML-6

Principal Component Analysis (PCA) (22)

The patterns show a significant differencefrom each other in one of the transformed axes

ML-7

PCA Derivations (113)

bull Suppose x is an n-dimensional zero mean random vectorndash If x is not zero mean we can subtract the mean

before processing the following analysis

ndash x can be represented without error by the summation of n linearly independent vectors

sum ===

n

iiiiy Φyφx [ ]Tni yyy where 1=y

[ ]ni φφφΦ 1=

0xμ x == E

The basis vectorsThe i-th component

in the feature (mapped) space

ML-8

PCA Derivations (213)

(23)

(01)

(10)

(11)(-11)

(52rsquo12rsquo)

⎥⎦

⎤⎢⎣

⎡+⎥

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡minus+⎥

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡+⎥

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡11

121

11 1

21

11

25

10

301

232

orthogonal basis sets

ML-9

PCA Derivations (313)

ndash Further assume the column (basis) vectors of the matrix form an orthonormal set

bull Such that is equal to the projectionof on

⎩⎨⎧

ne=

=jiji

jTi if 0

if 1φφ

iφxxx T

iiT

ii y ϕϕ ==forall

iy

x1ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Φ

ML-10

PCA Derivations (413)

ndash Further assume the column (basis) vectors of the matrix form an orthonormal set

bull also has the following propertiesndash Its mean is zero too

ndash Its variance is

bull The correlation between two projections and is

Φiy

0==== 0xx Ti

Ti

Tii EEyE ϕϕϕ jy

( )( ) j

Tij

TTi

jTT

iTT

jTiji

E

EEyyE

Rφφφxxφ

φxxφxφxφ

==

==

[ ] [ ]xRR

xxxx

ofmatrix relation (auto-)cor theis

2222

iTi

iTT

iiTT

i EEyEyEyEii

i

ϕϕ

ϕϕϕϕσ

=

===minus=

( )( )

sum==

minus⎟⎠⎞

⎜⎝⎛

sumasymp

minusminus=

=

i

Tii

T

TN

i

Tii

T

NE

N

E

xxxxR

μμxx

μxμxΣ

1

1

][

1 0

iy jy

ML-11

PCA Derivations (513)

bull Minimum Mean-Squared Error Criterionndash We want to choose only m of that we still can

approximate well in mean-squared error criterionxsiφ

sumsumsum+===

+==n

mjjj

m

ii

n

ii yyy

111φφφx ii

( ) sum=

=m

iiym

1ˆ iφx

( ) ( )

sumsum

sum

sum sum

sumsum

+=+=

+=

+= +=

+=+=

==

=

⎭⎬⎫

⎩⎨⎧=

⎭⎬⎫

⎩⎨⎧

⎟⎠⎞⎜

⎝⎛⎟⎠⎞

⎜⎝⎛=minus=

n

mjj

Tj

n

mjj

n

mjj

kTj

n

mj

n

mkkj

kn

mkk

Tj

n

mjj

yE

yyE

yyEmEm

11

2

1

2

1 1

11

2

ˆ

R φφ

φφ

φφxx

σ

ε

⎩⎨⎧

ne=

=kjkj

kTj if 0

if 1φφQ

[ ] 2

222

0

j

j

yE

yEyE

yE

jj

j

=

minus=

=

σ

We should discard the

bases where the projections have lower variances

original vector

reconstructed vector

ML-12

PCA Derivations (613)

bull Minimum Mean-Squared Error Criterionndash If the orthonormal (basis) set is selected to be the

eigenvectors of the correlation matrix associated with eigenvalues

bull They will have the property that

ndash Such that the mean-squared error mentioned above will be

siφR

siλ

jjj φR φ λ=

( )

sumsumsum

sum

+=+=+=

+=

===

=

n

mjj

n

mjjj

Tj

n

mjj

Tj

n

mjjm

111

1

2

λλ

σε

φφR φφ

is real and symmetric therefore its eigenvectors

form a orthonormal setR

is positive definite ( )=gt all eigenvalues are positive

R 0gtRxxT

ML-13

PCA Derivations (713)

bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest

eigenvalues the mean-squared error will be

ndash Any two projections and will be mutually uncorrelated

bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices

( ) ( )0 where 11

gegegegegesum=+=

nmn

mjjeigen m λλλλε

iy jy

( )( ) 0 j ====

==

jTij

Tij

TTi

jTT

iTT

jTiji

E

EEyyE

φφRφφφxxφ

φxxφxφxφ

λ

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

sdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdot

sdotsdot

nnnn

n

σσσ

σσσσσ

21

2222

11211

( )( )

sum==

minus⎟⎠⎞

⎜⎝⎛

sumasymp

minusminus=

=

i

Tii

T

TN

i

Tii

T

NE

N

E

xxxxR

μμxx

μxμxΣ

1

1

][

1

ML-14

PCA Derivations (813)

bull An two-dimensional example of Principle Component Analysis

ML-15

PCA Derivations (913)

bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the

mean-squared error criterion ( ) meigenε

( )

[ ]( )[ ]( )

[ ] [ ][ ]( )nmmnmnmnmn

nmmnnm

nmmnjmnjnjm

jnmjTn

mkkjkj

jnjm

n

mj

n

mkjkk

Tjjk

n

mjj

Tj

μμμJ

μJ

j

μμUUΦRΦ

μμΦφφR

φφΦμΦRφ

μ0φRφφ

φφRφφ

where

where

where 22

Define

1

11

11

1 1

1

1 11

+minusminusminusminus

+minus+

+minusminuslele+

++=

lele+

+= +=+=

==rArr

=rArr

==forallrArr

==summinus=partpart

forallrArr

sum sum minusminussum= δTake derivation

Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors

mnminusUnm λλ 1+ R nm φφ 1+

ϕϕϕϕ RR 2=

partpart T

constraintsTo be minimized ⎩⎨⎧

ne=

=kjkj

jk if 0 if 1

δ

Objective function

ML-16

PCA Derivations (1013)

bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)

such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion

Encoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

2

1

x

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

Decoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

ˆˆˆ

ˆ2

1

x

( ) ( )( )-xx-xxE Tx ˆˆ minimize

xΦy Tprime=

[ ]m

T

eeeΦΦ

where

21=primeprime

yΦx prime=ˆ

Φ prime

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

ML-17

PCA Derivations (1113)

bull Data compression in communication

ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition

ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns

(To be discussed later on)

ML-18

PCA Derivations (1213)

bull Scree Graphndash The plot of variance as a function of the number of eigenvectors

keptbull Select such that

bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)

Thresholdnm

m ge+++++

+++λλλλ

λλλLL

L

21

21m

sumge=

n

iim n 1

1 λλ

ML-19

PCA Derivations (1313)

bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal

( ) WSWWSWSSSW bT

wT

bwJ +=+==~~~

WSWS wT

w =~

WSWS bT

b =~

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

( )( )sum minusminus=i

TiiN

xxxxS 1

sample index

class index

bw SSS += thatshowTry to

ML-20

PCA Examples Data Analysis

bull Example 1 principal components of some data points

ML-21

PCA Examples Feature Transformation

bull Example 2 feature transformation and selection

threshold for information content reserved

New feature dimensions

Correlation matrix for old feature dimensions

ML-22

PCA Examples Image Coding (12)

bull Example 3 Image Coding

256

256

8

8

ML-23

PCA Examples Image Coding (22)

bull Example 3 Image Coding (cont)(value reduction)(feature reduction)

ML-24

PCA Examples Eigenface (14)

bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)

ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images

ndash Stepsbull Convert each of the L reference images into a vector of

floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these

reference vectors bull Apply Principal Component Analysis (PCA) find the

eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are

called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nL

L

L

L

nn x

xx

x

xx

x

xx

2

1

2

22

12

2

1

21

11

1

xxx

ML-25

PCA Examples Eigenface (24)

bull Example 4 Eigenface in face recognition (cont) ndash Steps

bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces

ndash The Eigenface approach persists the minimum mean-squared error criterion

ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces

( ) ( ) ( )[ ]Kiiii

Kiiii

wwwKwww

21

21

121ˆ

=rArr

++++=

yeeexx

Feature vector of a person i

ML-26

PCA Examples Eigenface (34)

The averaged face

Face images as the training set

ML-27

PCA Examples Eigenface (44)

A projected Face imageSeven eigenfaces derived from the training set

(Indicate directions of variations between faces )

ML-28

PCA Examples Eigenvoice (13)

bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)

ndash Stepsbull Concatenating the regarded parameters for each speaker r to

form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)

Eigenvoice Eigenvoice space space

constructionconstruction

Speaker 1 Data

SI HMM

Speaker R Data

Model Training Model Training

Speaker 1 HMM Speaker R HMM

D = (Mn)times1 Principal Component

Analysis

Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace

( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=

SI HMM model

ML-29

PCA Examples Eigenvoice (23)

bull Example 4 Eigenvoice in speaker adaptation (cont)

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 4: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-4

Introduction (33)

bull Formulation for discriminative feature extraction ndash Model-free (nonparametric)

bull Without prior information eg PCA bull With prior information eg LDA

ndash Model-dependent (parametric)bull Eg PLSA (Probabilistic Latent Semantic Analysis) with EM

(Expectation-Maximization) MCE (Minimum Classification Error) Training

ML-5

Principal Component Analysis (PCA) (12)

bull Known as Karhunen-Loẻve Transform (1947 1963)

ndash Or Hotelling Transform (1933)

bull A standard technique commonly used for data reduction in statistical pattern recognition and signal processing

bull A transform by which the data set can be represented by reduced number of effective features and still retain the most intrinsic information contentndash A small set of features to be found to represent the data samples

accurately

bull Also called ldquoSubspace Decompositionrdquo ldquoFactor Analysisrdquo

Pearson 1901

ML-6

Principal Component Analysis (PCA) (22)

The patterns show a significant differencefrom each other in one of the transformed axes

ML-7

PCA Derivations (113)

bull Suppose x is an n-dimensional zero mean random vectorndash If x is not zero mean we can subtract the mean

before processing the following analysis

ndash x can be represented without error by the summation of n linearly independent vectors

sum ===

n

iiiiy Φyφx [ ]Tni yyy where 1=y

[ ]ni φφφΦ 1=

0xμ x == E

The basis vectorsThe i-th component

in the feature (mapped) space

ML-8

PCA Derivations (213)

(23)

(01)

(10)

(11)(-11)

(52rsquo12rsquo)

⎥⎦

⎤⎢⎣

⎡+⎥

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡minus+⎥

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡+⎥

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡11

121

11 1

21

11

25

10

301

232

orthogonal basis sets

ML-9

PCA Derivations (313)

ndash Further assume the column (basis) vectors of the matrix form an orthonormal set

bull Such that is equal to the projectionof on

⎩⎨⎧

ne=

=jiji

jTi if 0

if 1φφ

iφxxx T

iiT

ii y ϕϕ ==forall

iy

x1ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Φ

ML-10

PCA Derivations (413)

ndash Further assume the column (basis) vectors of the matrix form an orthonormal set

bull also has the following propertiesndash Its mean is zero too

ndash Its variance is

bull The correlation between two projections and is

Φiy

0==== 0xx Ti

Ti

Tii EEyE ϕϕϕ jy

( )( ) j

Tij

TTi

jTT

iTT

jTiji

E

EEyyE

Rφφφxxφ

φxxφxφxφ

==

==

[ ] [ ]xRR

xxxx

ofmatrix relation (auto-)cor theis

2222

iTi

iTT

iiTT

i EEyEyEyEii

i

ϕϕ

ϕϕϕϕσ

=

===minus=

( )( )

sum==

minus⎟⎠⎞

⎜⎝⎛

sumasymp

minusminus=

=

i

Tii

T

TN

i

Tii

T

NE

N

E

xxxxR

μμxx

μxμxΣ

1

1

][

1 0

iy jy

ML-11

PCA Derivations (513)

bull Minimum Mean-Squared Error Criterionndash We want to choose only m of that we still can

approximate well in mean-squared error criterionxsiφ

sumsumsum+===

+==n

mjjj

m

ii

n

ii yyy

111φφφx ii

( ) sum=

=m

iiym

1ˆ iφx

( ) ( )

sumsum

sum

sum sum

sumsum

+=+=

+=

+= +=

+=+=

==

=

⎭⎬⎫

⎩⎨⎧=

⎭⎬⎫

⎩⎨⎧

⎟⎠⎞⎜

⎝⎛⎟⎠⎞

⎜⎝⎛=minus=

n

mjj

Tj

n

mjj

n

mjj

kTj

n

mj

n

mkkj

kn

mkk

Tj

n

mjj

yE

yyE

yyEmEm

11

2

1

2

1 1

11

2

ˆ

R φφ

φφ

φφxx

σ

ε

⎩⎨⎧

ne=

=kjkj

kTj if 0

if 1φφQ

[ ] 2

222

0

j

j

yE

yEyE

yE

jj

j

=

minus=

=

σ

We should discard the

bases where the projections have lower variances

original vector

reconstructed vector

ML-12

PCA Derivations (613)

bull Minimum Mean-Squared Error Criterionndash If the orthonormal (basis) set is selected to be the

eigenvectors of the correlation matrix associated with eigenvalues

bull They will have the property that

ndash Such that the mean-squared error mentioned above will be

siφR

siλ

jjj φR φ λ=

( )

sumsumsum

sum

+=+=+=

+=

===

=

n

mjj

n

mjjj

Tj

n

mjj

Tj

n

mjjm

111

1

2

λλ

σε

φφR φφ

is real and symmetric therefore its eigenvectors

form a orthonormal setR

is positive definite ( )=gt all eigenvalues are positive

R 0gtRxxT

ML-13

PCA Derivations (713)

bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest

eigenvalues the mean-squared error will be

ndash Any two projections and will be mutually uncorrelated

bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices

( ) ( )0 where 11

gegegegegesum=+=

nmn

mjjeigen m λλλλε

iy jy

( )( ) 0 j ====

==

jTij

Tij

TTi

jTT

iTT

jTiji

E

EEyyE

φφRφφφxxφ

φxxφxφxφ

λ

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

sdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdot

sdotsdot

nnnn

n

σσσ

σσσσσ

21

2222

11211

( )( )

sum==

minus⎟⎠⎞

⎜⎝⎛

sumasymp

minusminus=

=

i

Tii

T

TN

i

Tii

T

NE

N

E

xxxxR

μμxx

μxμxΣ

1

1

][

1

ML-14

PCA Derivations (813)

bull An two-dimensional example of Principle Component Analysis

ML-15

PCA Derivations (913)

bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the

mean-squared error criterion ( ) meigenε

( )

[ ]( )[ ]( )

[ ] [ ][ ]( )nmmnmnmnmn

nmmnnm

nmmnjmnjnjm

jnmjTn

mkkjkj

jnjm

n

mj

n

mkjkk

Tjjk

n

mjj

Tj

μμμJ

μJ

j

μμUUΦRΦ

μμΦφφR

φφΦμΦRφ

μ0φRφφ

φφRφφ

where

where

where 22

Define

1

11

11

1 1

1

1 11

+minusminusminusminus

+minus+

+minusminuslele+

++=

lele+

+= +=+=

==rArr

=rArr

==forallrArr

==summinus=partpart

forallrArr

sum sum minusminussum= δTake derivation

Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors

mnminusUnm λλ 1+ R nm φφ 1+

ϕϕϕϕ RR 2=

partpart T

constraintsTo be minimized ⎩⎨⎧

ne=

=kjkj

jk if 0 if 1

δ

Objective function

ML-16

PCA Derivations (1013)

bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)

such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion

Encoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

2

1

x

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

Decoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

ˆˆˆ

ˆ2

1

x

( ) ( )( )-xx-xxE Tx ˆˆ minimize

xΦy Tprime=

[ ]m

T

eeeΦΦ

where

21=primeprime

yΦx prime=ˆ

Φ prime

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

ML-17

PCA Derivations (1113)

bull Data compression in communication

ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition

ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns

(To be discussed later on)

ML-18

PCA Derivations (1213)

bull Scree Graphndash The plot of variance as a function of the number of eigenvectors

keptbull Select such that

bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)

Thresholdnm

m ge+++++

+++λλλλ

λλλLL

L

21

21m

sumge=

n

iim n 1

1 λλ

ML-19

PCA Derivations (1313)

bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal

( ) WSWWSWSSSW bT

wT

bwJ +=+==~~~

WSWS wT

w =~

WSWS bT

b =~

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

( )( )sum minusminus=i

TiiN

xxxxS 1

sample index

class index

bw SSS += thatshowTry to

ML-20

PCA Examples Data Analysis

bull Example 1 principal components of some data points

ML-21

PCA Examples Feature Transformation

bull Example 2 feature transformation and selection

threshold for information content reserved

New feature dimensions

Correlation matrix for old feature dimensions

ML-22

PCA Examples Image Coding (12)

bull Example 3 Image Coding

256

256

8

8

ML-23

PCA Examples Image Coding (22)

bull Example 3 Image Coding (cont)(value reduction)(feature reduction)

ML-24

PCA Examples Eigenface (14)

bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)

ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images

ndash Stepsbull Convert each of the L reference images into a vector of

floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these

reference vectors bull Apply Principal Component Analysis (PCA) find the

eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are

called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nL

L

L

L

nn x

xx

x

xx

x

xx

2

1

2

22

12

2

1

21

11

1

xxx

ML-25

PCA Examples Eigenface (24)

bull Example 4 Eigenface in face recognition (cont) ndash Steps

bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces

ndash The Eigenface approach persists the minimum mean-squared error criterion

ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces

( ) ( ) ( )[ ]Kiiii

Kiiii

wwwKwww

21

21

121ˆ

=rArr

++++=

yeeexx

Feature vector of a person i

ML-26

PCA Examples Eigenface (34)

The averaged face

Face images as the training set

ML-27

PCA Examples Eigenface (44)

A projected Face imageSeven eigenfaces derived from the training set

(Indicate directions of variations between faces )

ML-28

PCA Examples Eigenvoice (13)

bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)

ndash Stepsbull Concatenating the regarded parameters for each speaker r to

form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)

Eigenvoice Eigenvoice space space

constructionconstruction

Speaker 1 Data

SI HMM

Speaker R Data

Model Training Model Training

Speaker 1 HMM Speaker R HMM

D = (Mn)times1 Principal Component

Analysis

Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace

( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=

SI HMM model

ML-29

PCA Examples Eigenvoice (23)

bull Example 4 Eigenvoice in speaker adaptation (cont)

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 5: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-5

Principal Component Analysis (PCA) (12)

bull Known as Karhunen-Loẻve Transform (1947 1963)

ndash Or Hotelling Transform (1933)

bull A standard technique commonly used for data reduction in statistical pattern recognition and signal processing

bull A transform by which the data set can be represented by reduced number of effective features and still retain the most intrinsic information contentndash A small set of features to be found to represent the data samples

accurately

bull Also called ldquoSubspace Decompositionrdquo ldquoFactor Analysisrdquo

Pearson 1901

ML-6

Principal Component Analysis (PCA) (22)

The patterns show a significant differencefrom each other in one of the transformed axes

ML-7

PCA Derivations (113)

bull Suppose x is an n-dimensional zero mean random vectorndash If x is not zero mean we can subtract the mean

before processing the following analysis

ndash x can be represented without error by the summation of n linearly independent vectors

sum ===

n

iiiiy Φyφx [ ]Tni yyy where 1=y

[ ]ni φφφΦ 1=

0xμ x == E

The basis vectorsThe i-th component

in the feature (mapped) space

ML-8

PCA Derivations (213)

(23)

(01)

(10)

(11)(-11)

(52rsquo12rsquo)

⎥⎦

⎤⎢⎣

⎡+⎥

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡minus+⎥

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡+⎥

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡11

121

11 1

21

11

25

10

301

232

orthogonal basis sets

ML-9

PCA Derivations (313)

ndash Further assume the column (basis) vectors of the matrix form an orthonormal set

bull Such that is equal to the projectionof on

⎩⎨⎧

ne=

=jiji

jTi if 0

if 1φφ

iφxxx T

iiT

ii y ϕϕ ==forall

iy

x1ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Φ

ML-10

PCA Derivations (413)

ndash Further assume the column (basis) vectors of the matrix form an orthonormal set

bull also has the following propertiesndash Its mean is zero too

ndash Its variance is

bull The correlation between two projections and is

Φiy

0==== 0xx Ti

Ti

Tii EEyE ϕϕϕ jy

( )( ) j

Tij

TTi

jTT

iTT

jTiji

E

EEyyE

Rφφφxxφ

φxxφxφxφ

==

==

[ ] [ ]xRR

xxxx

ofmatrix relation (auto-)cor theis

2222

iTi

iTT

iiTT

i EEyEyEyEii

i

ϕϕ

ϕϕϕϕσ

=

===minus=

( )( )

sum==

minus⎟⎠⎞

⎜⎝⎛

sumasymp

minusminus=

=

i

Tii

T

TN

i

Tii

T

NE

N

E

xxxxR

μμxx

μxμxΣ

1

1

][

1 0

iy jy

ML-11

PCA Derivations (513)

bull Minimum Mean-Squared Error Criterionndash We want to choose only m of that we still can

approximate well in mean-squared error criterionxsiφ

sumsumsum+===

+==n

mjjj

m

ii

n

ii yyy

111φφφx ii

( ) sum=

=m

iiym

1ˆ iφx

( ) ( )

sumsum

sum

sum sum

sumsum

+=+=

+=

+= +=

+=+=

==

=

⎭⎬⎫

⎩⎨⎧=

⎭⎬⎫

⎩⎨⎧

⎟⎠⎞⎜

⎝⎛⎟⎠⎞

⎜⎝⎛=minus=

n

mjj

Tj

n

mjj

n

mjj

kTj

n

mj

n

mkkj

kn

mkk

Tj

n

mjj

yE

yyE

yyEmEm

11

2

1

2

1 1

11

2

ˆ

R φφ

φφ

φφxx

σ

ε

⎩⎨⎧

ne=

=kjkj

kTj if 0

if 1φφQ

[ ] 2

222

0

j

j

yE

yEyE

yE

jj

j

=

minus=

=

σ

We should discard the

bases where the projections have lower variances

original vector

reconstructed vector

ML-12

PCA Derivations (613)

bull Minimum Mean-Squared Error Criterionndash If the orthonormal (basis) set is selected to be the

eigenvectors of the correlation matrix associated with eigenvalues

bull They will have the property that

ndash Such that the mean-squared error mentioned above will be

siφR

siλ

jjj φR φ λ=

( )

sumsumsum

sum

+=+=+=

+=

===

=

n

mjj

n

mjjj

Tj

n

mjj

Tj

n

mjjm

111

1

2

λλ

σε

φφR φφ

is real and symmetric therefore its eigenvectors

form a orthonormal setR

is positive definite ( )=gt all eigenvalues are positive

R 0gtRxxT

ML-13

PCA Derivations (713)

bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest

eigenvalues the mean-squared error will be

ndash Any two projections and will be mutually uncorrelated

bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices

( ) ( )0 where 11

gegegegegesum=+=

nmn

mjjeigen m λλλλε

iy jy

( )( ) 0 j ====

==

jTij

Tij

TTi

jTT

iTT

jTiji

E

EEyyE

φφRφφφxxφ

φxxφxφxφ

λ

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

sdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdot

sdotsdot

nnnn

n

σσσ

σσσσσ

21

2222

11211

( )( )

sum==

minus⎟⎠⎞

⎜⎝⎛

sumasymp

minusminus=

=

i

Tii

T

TN

i

Tii

T

NE

N

E

xxxxR

μμxx

μxμxΣ

1

1

][

1

ML-14

PCA Derivations (813)

bull An two-dimensional example of Principle Component Analysis

ML-15

PCA Derivations (913)

bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the

mean-squared error criterion ( ) meigenε

( )

[ ]( )[ ]( )

[ ] [ ][ ]( )nmmnmnmnmn

nmmnnm

nmmnjmnjnjm

jnmjTn

mkkjkj

jnjm

n

mj

n

mkjkk

Tjjk

n

mjj

Tj

μμμJ

μJ

j

μμUUΦRΦ

μμΦφφR

φφΦμΦRφ

μ0φRφφ

φφRφφ

where

where

where 22

Define

1

11

11

1 1

1

1 11

+minusminusminusminus

+minus+

+minusminuslele+

++=

lele+

+= +=+=

==rArr

=rArr

==forallrArr

==summinus=partpart

forallrArr

sum sum minusminussum= δTake derivation

Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors

mnminusUnm λλ 1+ R nm φφ 1+

ϕϕϕϕ RR 2=

partpart T

constraintsTo be minimized ⎩⎨⎧

ne=

=kjkj

jk if 0 if 1

δ

Objective function

ML-16

PCA Derivations (1013)

bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)

such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion

Encoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

2

1

x

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

Decoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

ˆˆˆ

ˆ2

1

x

( ) ( )( )-xx-xxE Tx ˆˆ minimize

xΦy Tprime=

[ ]m

T

eeeΦΦ

where

21=primeprime

yΦx prime=ˆ

Φ prime

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

ML-17

PCA Derivations (1113)

bull Data compression in communication

ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition

ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns

(To be discussed later on)

ML-18

PCA Derivations (1213)

bull Scree Graphndash The plot of variance as a function of the number of eigenvectors

keptbull Select such that

bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)

Thresholdnm

m ge+++++

+++λλλλ

λλλLL

L

21

21m

sumge=

n

iim n 1

1 λλ

ML-19

PCA Derivations (1313)

bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal

( ) WSWWSWSSSW bT

wT

bwJ +=+==~~~

WSWS wT

w =~

WSWS bT

b =~

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

( )( )sum minusminus=i

TiiN

xxxxS 1

sample index

class index

bw SSS += thatshowTry to

ML-20

PCA Examples Data Analysis

bull Example 1 principal components of some data points

ML-21

PCA Examples Feature Transformation

bull Example 2 feature transformation and selection

threshold for information content reserved

New feature dimensions

Correlation matrix for old feature dimensions

ML-22

PCA Examples Image Coding (12)

bull Example 3 Image Coding

256

256

8

8

ML-23

PCA Examples Image Coding (22)

bull Example 3 Image Coding (cont)(value reduction)(feature reduction)

ML-24

PCA Examples Eigenface (14)

bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)

ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images

ndash Stepsbull Convert each of the L reference images into a vector of

floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these

reference vectors bull Apply Principal Component Analysis (PCA) find the

eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are

called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nL

L

L

L

nn x

xx

x

xx

x

xx

2

1

2

22

12

2

1

21

11

1

xxx

ML-25

PCA Examples Eigenface (24)

bull Example 4 Eigenface in face recognition (cont) ndash Steps

bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces

ndash The Eigenface approach persists the minimum mean-squared error criterion

ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces

( ) ( ) ( )[ ]Kiiii

Kiiii

wwwKwww

21

21

121ˆ

=rArr

++++=

yeeexx

Feature vector of a person i

ML-26

PCA Examples Eigenface (34)

The averaged face

Face images as the training set

ML-27

PCA Examples Eigenface (44)

A projected Face imageSeven eigenfaces derived from the training set

(Indicate directions of variations between faces )

ML-28

PCA Examples Eigenvoice (13)

bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)

ndash Stepsbull Concatenating the regarded parameters for each speaker r to

form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)

Eigenvoice Eigenvoice space space

constructionconstruction

Speaker 1 Data

SI HMM

Speaker R Data

Model Training Model Training

Speaker 1 HMM Speaker R HMM

D = (Mn)times1 Principal Component

Analysis

Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace

( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=

SI HMM model

ML-29

PCA Examples Eigenvoice (23)

bull Example 4 Eigenvoice in speaker adaptation (cont)

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 6: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-6

Principal Component Analysis (PCA) (22)

The patterns show a significant differencefrom each other in one of the transformed axes

ML-7

PCA Derivations (113)

bull Suppose x is an n-dimensional zero mean random vectorndash If x is not zero mean we can subtract the mean

before processing the following analysis

ndash x can be represented without error by the summation of n linearly independent vectors

sum ===

n

iiiiy Φyφx [ ]Tni yyy where 1=y

[ ]ni φφφΦ 1=

0xμ x == E

The basis vectorsThe i-th component

in the feature (mapped) space

ML-8

PCA Derivations (213)

(23)

(01)

(10)

(11)(-11)

(52rsquo12rsquo)

⎥⎦

⎤⎢⎣

⎡+⎥

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡minus+⎥

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡+⎥

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡11

121

11 1

21

11

25

10

301

232

orthogonal basis sets

ML-9

PCA Derivations (313)

ndash Further assume the column (basis) vectors of the matrix form an orthonormal set

bull Such that is equal to the projectionof on

⎩⎨⎧

ne=

=jiji

jTi if 0

if 1φφ

iφxxx T

iiT

ii y ϕϕ ==forall

iy

x1ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Φ

ML-10

PCA Derivations (413)

ndash Further assume the column (basis) vectors of the matrix form an orthonormal set

bull also has the following propertiesndash Its mean is zero too

ndash Its variance is

bull The correlation between two projections and is

Φiy

0==== 0xx Ti

Ti

Tii EEyE ϕϕϕ jy

( )( ) j

Tij

TTi

jTT

iTT

jTiji

E

EEyyE

Rφφφxxφ

φxxφxφxφ

==

==

[ ] [ ]xRR

xxxx

ofmatrix relation (auto-)cor theis

2222

iTi

iTT

iiTT

i EEyEyEyEii

i

ϕϕ

ϕϕϕϕσ

=

===minus=

( )( )

sum==

minus⎟⎠⎞

⎜⎝⎛

sumasymp

minusminus=

=

i

Tii

T

TN

i

Tii

T

NE

N

E

xxxxR

μμxx

μxμxΣ

1

1

][

1 0

iy jy

ML-11

PCA Derivations (513)

bull Minimum Mean-Squared Error Criterionndash We want to choose only m of that we still can

approximate well in mean-squared error criterionxsiφ

sumsumsum+===

+==n

mjjj

m

ii

n

ii yyy

111φφφx ii

( ) sum=

=m

iiym

1ˆ iφx

( ) ( )

sumsum

sum

sum sum

sumsum

+=+=

+=

+= +=

+=+=

==

=

⎭⎬⎫

⎩⎨⎧=

⎭⎬⎫

⎩⎨⎧

⎟⎠⎞⎜

⎝⎛⎟⎠⎞

⎜⎝⎛=minus=

n

mjj

Tj

n

mjj

n

mjj

kTj

n

mj

n

mkkj

kn

mkk

Tj

n

mjj

yE

yyE

yyEmEm

11

2

1

2

1 1

11

2

ˆ

R φφ

φφ

φφxx

σ

ε

⎩⎨⎧

ne=

=kjkj

kTj if 0

if 1φφQ

[ ] 2

222

0

j

j

yE

yEyE

yE

jj

j

=

minus=

=

σ

We should discard the

bases where the projections have lower variances

original vector

reconstructed vector

ML-12

PCA Derivations (613)

bull Minimum Mean-Squared Error Criterionndash If the orthonormal (basis) set is selected to be the

eigenvectors of the correlation matrix associated with eigenvalues

bull They will have the property that

ndash Such that the mean-squared error mentioned above will be

siφR

siλ

jjj φR φ λ=

( )

sumsumsum

sum

+=+=+=

+=

===

=

n

mjj

n

mjjj

Tj

n

mjj

Tj

n

mjjm

111

1

2

λλ

σε

φφR φφ

is real and symmetric therefore its eigenvectors

form a orthonormal setR

is positive definite ( )=gt all eigenvalues are positive

R 0gtRxxT

ML-13

PCA Derivations (713)

bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest

eigenvalues the mean-squared error will be

ndash Any two projections and will be mutually uncorrelated

bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices

( ) ( )0 where 11

gegegegegesum=+=

nmn

mjjeigen m λλλλε

iy jy

( )( ) 0 j ====

==

jTij

Tij

TTi

jTT

iTT

jTiji

E

EEyyE

φφRφφφxxφ

φxxφxφxφ

λ

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

sdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdot

sdotsdot

nnnn

n

σσσ

σσσσσ

21

2222

11211

( )( )

sum==

minus⎟⎠⎞

⎜⎝⎛

sumasymp

minusminus=

=

i

Tii

T

TN

i

Tii

T

NE

N

E

xxxxR

μμxx

μxμxΣ

1

1

][

1

ML-14

PCA Derivations (813)

bull An two-dimensional example of Principle Component Analysis

ML-15

PCA Derivations (913)

bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the

mean-squared error criterion ( ) meigenε

( )

[ ]( )[ ]( )

[ ] [ ][ ]( )nmmnmnmnmn

nmmnnm

nmmnjmnjnjm

jnmjTn

mkkjkj

jnjm

n

mj

n

mkjkk

Tjjk

n

mjj

Tj

μμμJ

μJ

j

μμUUΦRΦ

μμΦφφR

φφΦμΦRφ

μ0φRφφ

φφRφφ

where

where

where 22

Define

1

11

11

1 1

1

1 11

+minusminusminusminus

+minus+

+minusminuslele+

++=

lele+

+= +=+=

==rArr

=rArr

==forallrArr

==summinus=partpart

forallrArr

sum sum minusminussum= δTake derivation

Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors

mnminusUnm λλ 1+ R nm φφ 1+

ϕϕϕϕ RR 2=

partpart T

constraintsTo be minimized ⎩⎨⎧

ne=

=kjkj

jk if 0 if 1

δ

Objective function

ML-16

PCA Derivations (1013)

bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)

such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion

Encoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

2

1

x

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

Decoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

ˆˆˆ

ˆ2

1

x

( ) ( )( )-xx-xxE Tx ˆˆ minimize

xΦy Tprime=

[ ]m

T

eeeΦΦ

where

21=primeprime

yΦx prime=ˆ

Φ prime

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

ML-17

PCA Derivations (1113)

bull Data compression in communication

ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition

ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns

(To be discussed later on)

ML-18

PCA Derivations (1213)

bull Scree Graphndash The plot of variance as a function of the number of eigenvectors

keptbull Select such that

bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)

Thresholdnm

m ge+++++

+++λλλλ

λλλLL

L

21

21m

sumge=

n

iim n 1

1 λλ

ML-19

PCA Derivations (1313)

bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal

( ) WSWWSWSSSW bT

wT

bwJ +=+==~~~

WSWS wT

w =~

WSWS bT

b =~

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

( )( )sum minusminus=i

TiiN

xxxxS 1

sample index

class index

bw SSS += thatshowTry to

ML-20

PCA Examples Data Analysis

bull Example 1 principal components of some data points

ML-21

PCA Examples Feature Transformation

bull Example 2 feature transformation and selection

threshold for information content reserved

New feature dimensions

Correlation matrix for old feature dimensions

ML-22

PCA Examples Image Coding (12)

bull Example 3 Image Coding

256

256

8

8

ML-23

PCA Examples Image Coding (22)

bull Example 3 Image Coding (cont)(value reduction)(feature reduction)

ML-24

PCA Examples Eigenface (14)

bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)

ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images

ndash Stepsbull Convert each of the L reference images into a vector of

floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these

reference vectors bull Apply Principal Component Analysis (PCA) find the

eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are

called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nL

L

L

L

nn x

xx

x

xx

x

xx

2

1

2

22

12

2

1

21

11

1

xxx

ML-25

PCA Examples Eigenface (24)

bull Example 4 Eigenface in face recognition (cont) ndash Steps

bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces

ndash The Eigenface approach persists the minimum mean-squared error criterion

ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces

( ) ( ) ( )[ ]Kiiii

Kiiii

wwwKwww

21

21

121ˆ

=rArr

++++=

yeeexx

Feature vector of a person i

ML-26

PCA Examples Eigenface (34)

The averaged face

Face images as the training set

ML-27

PCA Examples Eigenface (44)

A projected Face imageSeven eigenfaces derived from the training set

(Indicate directions of variations between faces )

ML-28

PCA Examples Eigenvoice (13)

bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)

ndash Stepsbull Concatenating the regarded parameters for each speaker r to

form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)

Eigenvoice Eigenvoice space space

constructionconstruction

Speaker 1 Data

SI HMM

Speaker R Data

Model Training Model Training

Speaker 1 HMM Speaker R HMM

D = (Mn)times1 Principal Component

Analysis

Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace

( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=

SI HMM model

ML-29

PCA Examples Eigenvoice (23)

bull Example 4 Eigenvoice in speaker adaptation (cont)

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 7: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-7

PCA Derivations (113)

bull Suppose x is an n-dimensional zero mean random vectorndash If x is not zero mean we can subtract the mean

before processing the following analysis

ndash x can be represented without error by the summation of n linearly independent vectors

sum ===

n

iiiiy Φyφx [ ]Tni yyy where 1=y

[ ]ni φφφΦ 1=

0xμ x == E

The basis vectorsThe i-th component

in the feature (mapped) space

ML-8

PCA Derivations (213)

(23)

(01)

(10)

(11)(-11)

(52rsquo12rsquo)

⎥⎦

⎤⎢⎣

⎡+⎥

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡minus+⎥

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡+⎥

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡11

121

11 1

21

11

25

10

301

232

orthogonal basis sets

ML-9

PCA Derivations (313)

ndash Further assume the column (basis) vectors of the matrix form an orthonormal set

bull Such that is equal to the projectionof on

⎩⎨⎧

ne=

=jiji

jTi if 0

if 1φφ

iφxxx T

iiT

ii y ϕϕ ==forall

iy

x1ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Φ

ML-10

PCA Derivations (413)

ndash Further assume the column (basis) vectors of the matrix form an orthonormal set

bull also has the following propertiesndash Its mean is zero too

ndash Its variance is

bull The correlation between two projections and is

Φiy

0==== 0xx Ti

Ti

Tii EEyE ϕϕϕ jy

( )( ) j

Tij

TTi

jTT

iTT

jTiji

E

EEyyE

Rφφφxxφ

φxxφxφxφ

==

==

[ ] [ ]xRR

xxxx

ofmatrix relation (auto-)cor theis

2222

iTi

iTT

iiTT

i EEyEyEyEii

i

ϕϕ

ϕϕϕϕσ

=

===minus=

( )( )

sum==

minus⎟⎠⎞

⎜⎝⎛

sumasymp

minusminus=

=

i

Tii

T

TN

i

Tii

T

NE

N

E

xxxxR

μμxx

μxμxΣ

1

1

][

1 0

iy jy

ML-11

PCA Derivations (513)

bull Minimum Mean-Squared Error Criterionndash We want to choose only m of that we still can

approximate well in mean-squared error criterionxsiφ

sumsumsum+===

+==n

mjjj

m

ii

n

ii yyy

111φφφx ii

( ) sum=

=m

iiym

1ˆ iφx

( ) ( )

sumsum

sum

sum sum

sumsum

+=+=

+=

+= +=

+=+=

==

=

⎭⎬⎫

⎩⎨⎧=

⎭⎬⎫

⎩⎨⎧

⎟⎠⎞⎜

⎝⎛⎟⎠⎞

⎜⎝⎛=minus=

n

mjj

Tj

n

mjj

n

mjj

kTj

n

mj

n

mkkj

kn

mkk

Tj

n

mjj

yE

yyE

yyEmEm

11

2

1

2

1 1

11

2

ˆ

R φφ

φφ

φφxx

σ

ε

⎩⎨⎧

ne=

=kjkj

kTj if 0

if 1φφQ

[ ] 2

222

0

j

j

yE

yEyE

yE

jj

j

=

minus=

=

σ

We should discard the

bases where the projections have lower variances

original vector

reconstructed vector

ML-12

PCA Derivations (613)

bull Minimum Mean-Squared Error Criterionndash If the orthonormal (basis) set is selected to be the

eigenvectors of the correlation matrix associated with eigenvalues

bull They will have the property that

ndash Such that the mean-squared error mentioned above will be

siφR

siλ

jjj φR φ λ=

( )

sumsumsum

sum

+=+=+=

+=

===

=

n

mjj

n

mjjj

Tj

n

mjj

Tj

n

mjjm

111

1

2

λλ

σε

φφR φφ

is real and symmetric therefore its eigenvectors

form a orthonormal setR

is positive definite ( )=gt all eigenvalues are positive

R 0gtRxxT

ML-13

PCA Derivations (713)

bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest

eigenvalues the mean-squared error will be

ndash Any two projections and will be mutually uncorrelated

bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices

( ) ( )0 where 11

gegegegegesum=+=

nmn

mjjeigen m λλλλε

iy jy

( )( ) 0 j ====

==

jTij

Tij

TTi

jTT

iTT

jTiji

E

EEyyE

φφRφφφxxφ

φxxφxφxφ

λ

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

sdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdot

sdotsdot

nnnn

n

σσσ

σσσσσ

21

2222

11211

( )( )

sum==

minus⎟⎠⎞

⎜⎝⎛

sumasymp

minusminus=

=

i

Tii

T

TN

i

Tii

T

NE

N

E

xxxxR

μμxx

μxμxΣ

1

1

][

1

ML-14

PCA Derivations (813)

bull An two-dimensional example of Principle Component Analysis

ML-15

PCA Derivations (913)

bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the

mean-squared error criterion ( ) meigenε

( )

[ ]( )[ ]( )

[ ] [ ][ ]( )nmmnmnmnmn

nmmnnm

nmmnjmnjnjm

jnmjTn

mkkjkj

jnjm

n

mj

n

mkjkk

Tjjk

n

mjj

Tj

μμμJ

μJ

j

μμUUΦRΦ

μμΦφφR

φφΦμΦRφ

μ0φRφφ

φφRφφ

where

where

where 22

Define

1

11

11

1 1

1

1 11

+minusminusminusminus

+minus+

+minusminuslele+

++=

lele+

+= +=+=

==rArr

=rArr

==forallrArr

==summinus=partpart

forallrArr

sum sum minusminussum= δTake derivation

Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors

mnminusUnm λλ 1+ R nm φφ 1+

ϕϕϕϕ RR 2=

partpart T

constraintsTo be minimized ⎩⎨⎧

ne=

=kjkj

jk if 0 if 1

δ

Objective function

ML-16

PCA Derivations (1013)

bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)

such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion

Encoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

2

1

x

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

Decoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

ˆˆˆ

ˆ2

1

x

( ) ( )( )-xx-xxE Tx ˆˆ minimize

xΦy Tprime=

[ ]m

T

eeeΦΦ

where

21=primeprime

yΦx prime=ˆ

Φ prime

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

ML-17

PCA Derivations (1113)

bull Data compression in communication

ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition

ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns

(To be discussed later on)

ML-18

PCA Derivations (1213)

bull Scree Graphndash The plot of variance as a function of the number of eigenvectors

keptbull Select such that

bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)

Thresholdnm

m ge+++++

+++λλλλ

λλλLL

L

21

21m

sumge=

n

iim n 1

1 λλ

ML-19

PCA Derivations (1313)

bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal

( ) WSWWSWSSSW bT

wT

bwJ +=+==~~~

WSWS wT

w =~

WSWS bT

b =~

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

( )( )sum minusminus=i

TiiN

xxxxS 1

sample index

class index

bw SSS += thatshowTry to

ML-20

PCA Examples Data Analysis

bull Example 1 principal components of some data points

ML-21

PCA Examples Feature Transformation

bull Example 2 feature transformation and selection

threshold for information content reserved

New feature dimensions

Correlation matrix for old feature dimensions

ML-22

PCA Examples Image Coding (12)

bull Example 3 Image Coding

256

256

8

8

ML-23

PCA Examples Image Coding (22)

bull Example 3 Image Coding (cont)(value reduction)(feature reduction)

ML-24

PCA Examples Eigenface (14)

bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)

ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images

ndash Stepsbull Convert each of the L reference images into a vector of

floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these

reference vectors bull Apply Principal Component Analysis (PCA) find the

eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are

called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nL

L

L

L

nn x

xx

x

xx

x

xx

2

1

2

22

12

2

1

21

11

1

xxx

ML-25

PCA Examples Eigenface (24)

bull Example 4 Eigenface in face recognition (cont) ndash Steps

bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces

ndash The Eigenface approach persists the minimum mean-squared error criterion

ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces

( ) ( ) ( )[ ]Kiiii

Kiiii

wwwKwww

21

21

121ˆ

=rArr

++++=

yeeexx

Feature vector of a person i

ML-26

PCA Examples Eigenface (34)

The averaged face

Face images as the training set

ML-27

PCA Examples Eigenface (44)

A projected Face imageSeven eigenfaces derived from the training set

(Indicate directions of variations between faces )

ML-28

PCA Examples Eigenvoice (13)

bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)

ndash Stepsbull Concatenating the regarded parameters for each speaker r to

form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)

Eigenvoice Eigenvoice space space

constructionconstruction

Speaker 1 Data

SI HMM

Speaker R Data

Model Training Model Training

Speaker 1 HMM Speaker R HMM

D = (Mn)times1 Principal Component

Analysis

Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace

( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=

SI HMM model

ML-29

PCA Examples Eigenvoice (23)

bull Example 4 Eigenvoice in speaker adaptation (cont)

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 8: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-8

PCA Derivations (213)

(23)

(01)

(10)

(11)(-11)

(52rsquo12rsquo)

⎥⎦

⎤⎢⎣

⎡+⎥

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡minus+⎥

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡+⎥

⎤⎢⎣

⎡=⎥

⎤⎢⎣

⎡11

121

11 1

21

11

25

10

301

232

orthogonal basis sets

ML-9

PCA Derivations (313)

ndash Further assume the column (basis) vectors of the matrix form an orthonormal set

bull Such that is equal to the projectionof on

⎩⎨⎧

ne=

=jiji

jTi if 0

if 1φφ

iφxxx T

iiT

ii y ϕϕ ==forall

iy

x1ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Φ

ML-10

PCA Derivations (413)

ndash Further assume the column (basis) vectors of the matrix form an orthonormal set

bull also has the following propertiesndash Its mean is zero too

ndash Its variance is

bull The correlation between two projections and is

Φiy

0==== 0xx Ti

Ti

Tii EEyE ϕϕϕ jy

( )( ) j

Tij

TTi

jTT

iTT

jTiji

E

EEyyE

Rφφφxxφ

φxxφxφxφ

==

==

[ ] [ ]xRR

xxxx

ofmatrix relation (auto-)cor theis

2222

iTi

iTT

iiTT

i EEyEyEyEii

i

ϕϕ

ϕϕϕϕσ

=

===minus=

( )( )

sum==

minus⎟⎠⎞

⎜⎝⎛

sumasymp

minusminus=

=

i

Tii

T

TN

i

Tii

T

NE

N

E

xxxxR

μμxx

μxμxΣ

1

1

][

1 0

iy jy

ML-11

PCA Derivations (513)

bull Minimum Mean-Squared Error Criterionndash We want to choose only m of that we still can

approximate well in mean-squared error criterionxsiφ

sumsumsum+===

+==n

mjjj

m

ii

n

ii yyy

111φφφx ii

( ) sum=

=m

iiym

1ˆ iφx

( ) ( )

sumsum

sum

sum sum

sumsum

+=+=

+=

+= +=

+=+=

==

=

⎭⎬⎫

⎩⎨⎧=

⎭⎬⎫

⎩⎨⎧

⎟⎠⎞⎜

⎝⎛⎟⎠⎞

⎜⎝⎛=minus=

n

mjj

Tj

n

mjj

n

mjj

kTj

n

mj

n

mkkj

kn

mkk

Tj

n

mjj

yE

yyE

yyEmEm

11

2

1

2

1 1

11

2

ˆ

R φφ

φφ

φφxx

σ

ε

⎩⎨⎧

ne=

=kjkj

kTj if 0

if 1φφQ

[ ] 2

222

0

j

j

yE

yEyE

yE

jj

j

=

minus=

=

σ

We should discard the

bases where the projections have lower variances

original vector

reconstructed vector

ML-12

PCA Derivations (613)

bull Minimum Mean-Squared Error Criterionndash If the orthonormal (basis) set is selected to be the

eigenvectors of the correlation matrix associated with eigenvalues

bull They will have the property that

ndash Such that the mean-squared error mentioned above will be

siφR

siλ

jjj φR φ λ=

( )

sumsumsum

sum

+=+=+=

+=

===

=

n

mjj

n

mjjj

Tj

n

mjj

Tj

n

mjjm

111

1

2

λλ

σε

φφR φφ

is real and symmetric therefore its eigenvectors

form a orthonormal setR

is positive definite ( )=gt all eigenvalues are positive

R 0gtRxxT

ML-13

PCA Derivations (713)

bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest

eigenvalues the mean-squared error will be

ndash Any two projections and will be mutually uncorrelated

bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices

( ) ( )0 where 11

gegegegegesum=+=

nmn

mjjeigen m λλλλε

iy jy

( )( ) 0 j ====

==

jTij

Tij

TTi

jTT

iTT

jTiji

E

EEyyE

φφRφφφxxφ

φxxφxφxφ

λ

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

sdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdot

sdotsdot

nnnn

n

σσσ

σσσσσ

21

2222

11211

( )( )

sum==

minus⎟⎠⎞

⎜⎝⎛

sumasymp

minusminus=

=

i

Tii

T

TN

i

Tii

T

NE

N

E

xxxxR

μμxx

μxμxΣ

1

1

][

1

ML-14

PCA Derivations (813)

bull An two-dimensional example of Principle Component Analysis

ML-15

PCA Derivations (913)

bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the

mean-squared error criterion ( ) meigenε

( )

[ ]( )[ ]( )

[ ] [ ][ ]( )nmmnmnmnmn

nmmnnm

nmmnjmnjnjm

jnmjTn

mkkjkj

jnjm

n

mj

n

mkjkk

Tjjk

n

mjj

Tj

μμμJ

μJ

j

μμUUΦRΦ

μμΦφφR

φφΦμΦRφ

μ0φRφφ

φφRφφ

where

where

where 22

Define

1

11

11

1 1

1

1 11

+minusminusminusminus

+minus+

+minusminuslele+

++=

lele+

+= +=+=

==rArr

=rArr

==forallrArr

==summinus=partpart

forallrArr

sum sum minusminussum= δTake derivation

Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors

mnminusUnm λλ 1+ R nm φφ 1+

ϕϕϕϕ RR 2=

partpart T

constraintsTo be minimized ⎩⎨⎧

ne=

=kjkj

jk if 0 if 1

δ

Objective function

ML-16

PCA Derivations (1013)

bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)

such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion

Encoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

2

1

x

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

Decoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

ˆˆˆ

ˆ2

1

x

( ) ( )( )-xx-xxE Tx ˆˆ minimize

xΦy Tprime=

[ ]m

T

eeeΦΦ

where

21=primeprime

yΦx prime=ˆ

Φ prime

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

ML-17

PCA Derivations (1113)

bull Data compression in communication

ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition

ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns

(To be discussed later on)

ML-18

PCA Derivations (1213)

bull Scree Graphndash The plot of variance as a function of the number of eigenvectors

keptbull Select such that

bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)

Thresholdnm

m ge+++++

+++λλλλ

λλλLL

L

21

21m

sumge=

n

iim n 1

1 λλ

ML-19

PCA Derivations (1313)

bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal

( ) WSWWSWSSSW bT

wT

bwJ +=+==~~~

WSWS wT

w =~

WSWS bT

b =~

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

( )( )sum minusminus=i

TiiN

xxxxS 1

sample index

class index

bw SSS += thatshowTry to

ML-20

PCA Examples Data Analysis

bull Example 1 principal components of some data points

ML-21

PCA Examples Feature Transformation

bull Example 2 feature transformation and selection

threshold for information content reserved

New feature dimensions

Correlation matrix for old feature dimensions

ML-22

PCA Examples Image Coding (12)

bull Example 3 Image Coding

256

256

8

8

ML-23

PCA Examples Image Coding (22)

bull Example 3 Image Coding (cont)(value reduction)(feature reduction)

ML-24

PCA Examples Eigenface (14)

bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)

ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images

ndash Stepsbull Convert each of the L reference images into a vector of

floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these

reference vectors bull Apply Principal Component Analysis (PCA) find the

eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are

called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nL

L

L

L

nn x

xx

x

xx

x

xx

2

1

2

22

12

2

1

21

11

1

xxx

ML-25

PCA Examples Eigenface (24)

bull Example 4 Eigenface in face recognition (cont) ndash Steps

bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces

ndash The Eigenface approach persists the minimum mean-squared error criterion

ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces

( ) ( ) ( )[ ]Kiiii

Kiiii

wwwKwww

21

21

121ˆ

=rArr

++++=

yeeexx

Feature vector of a person i

ML-26

PCA Examples Eigenface (34)

The averaged face

Face images as the training set

ML-27

PCA Examples Eigenface (44)

A projected Face imageSeven eigenfaces derived from the training set

(Indicate directions of variations between faces )

ML-28

PCA Examples Eigenvoice (13)

bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)

ndash Stepsbull Concatenating the regarded parameters for each speaker r to

form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)

Eigenvoice Eigenvoice space space

constructionconstruction

Speaker 1 Data

SI HMM

Speaker R Data

Model Training Model Training

Speaker 1 HMM Speaker R HMM

D = (Mn)times1 Principal Component

Analysis

Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace

( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=

SI HMM model

ML-29

PCA Examples Eigenvoice (23)

bull Example 4 Eigenvoice in speaker adaptation (cont)

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 9: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-9

PCA Derivations (313)

ndash Further assume the column (basis) vectors of the matrix form an orthonormal set

bull Such that is equal to the projectionof on

⎩⎨⎧

ne=

=jiji

jTi if 0

if 1φφ

iφxxx T

iiT

ii y ϕϕ ==forall

iy

x1ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Φ

ML-10

PCA Derivations (413)

ndash Further assume the column (basis) vectors of the matrix form an orthonormal set

bull also has the following propertiesndash Its mean is zero too

ndash Its variance is

bull The correlation between two projections and is

Φiy

0==== 0xx Ti

Ti

Tii EEyE ϕϕϕ jy

( )( ) j

Tij

TTi

jTT

iTT

jTiji

E

EEyyE

Rφφφxxφ

φxxφxφxφ

==

==

[ ] [ ]xRR

xxxx

ofmatrix relation (auto-)cor theis

2222

iTi

iTT

iiTT

i EEyEyEyEii

i

ϕϕ

ϕϕϕϕσ

=

===minus=

( )( )

sum==

minus⎟⎠⎞

⎜⎝⎛

sumasymp

minusminus=

=

i

Tii

T

TN

i

Tii

T

NE

N

E

xxxxR

μμxx

μxμxΣ

1

1

][

1 0

iy jy

ML-11

PCA Derivations (513)

bull Minimum Mean-Squared Error Criterionndash We want to choose only m of that we still can

approximate well in mean-squared error criterionxsiφ

sumsumsum+===

+==n

mjjj

m

ii

n

ii yyy

111φφφx ii

( ) sum=

=m

iiym

1ˆ iφx

( ) ( )

sumsum

sum

sum sum

sumsum

+=+=

+=

+= +=

+=+=

==

=

⎭⎬⎫

⎩⎨⎧=

⎭⎬⎫

⎩⎨⎧

⎟⎠⎞⎜

⎝⎛⎟⎠⎞

⎜⎝⎛=minus=

n

mjj

Tj

n

mjj

n

mjj

kTj

n

mj

n

mkkj

kn

mkk

Tj

n

mjj

yE

yyE

yyEmEm

11

2

1

2

1 1

11

2

ˆ

R φφ

φφ

φφxx

σ

ε

⎩⎨⎧

ne=

=kjkj

kTj if 0

if 1φφQ

[ ] 2

222

0

j

j

yE

yEyE

yE

jj

j

=

minus=

=

σ

We should discard the

bases where the projections have lower variances

original vector

reconstructed vector

ML-12

PCA Derivations (613)

bull Minimum Mean-Squared Error Criterionndash If the orthonormal (basis) set is selected to be the

eigenvectors of the correlation matrix associated with eigenvalues

bull They will have the property that

ndash Such that the mean-squared error mentioned above will be

siφR

siλ

jjj φR φ λ=

( )

sumsumsum

sum

+=+=+=

+=

===

=

n

mjj

n

mjjj

Tj

n

mjj

Tj

n

mjjm

111

1

2

λλ

σε

φφR φφ

is real and symmetric therefore its eigenvectors

form a orthonormal setR

is positive definite ( )=gt all eigenvalues are positive

R 0gtRxxT

ML-13

PCA Derivations (713)

bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest

eigenvalues the mean-squared error will be

ndash Any two projections and will be mutually uncorrelated

bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices

( ) ( )0 where 11

gegegegegesum=+=

nmn

mjjeigen m λλλλε

iy jy

( )( ) 0 j ====

==

jTij

Tij

TTi

jTT

iTT

jTiji

E

EEyyE

φφRφφφxxφ

φxxφxφxφ

λ

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

sdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdot

sdotsdot

nnnn

n

σσσ

σσσσσ

21

2222

11211

( )( )

sum==

minus⎟⎠⎞

⎜⎝⎛

sumasymp

minusminus=

=

i

Tii

T

TN

i

Tii

T

NE

N

E

xxxxR

μμxx

μxμxΣ

1

1

][

1

ML-14

PCA Derivations (813)

bull An two-dimensional example of Principle Component Analysis

ML-15

PCA Derivations (913)

bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the

mean-squared error criterion ( ) meigenε

( )

[ ]( )[ ]( )

[ ] [ ][ ]( )nmmnmnmnmn

nmmnnm

nmmnjmnjnjm

jnmjTn

mkkjkj

jnjm

n

mj

n

mkjkk

Tjjk

n

mjj

Tj

μμμJ

μJ

j

μμUUΦRΦ

μμΦφφR

φφΦμΦRφ

μ0φRφφ

φφRφφ

where

where

where 22

Define

1

11

11

1 1

1

1 11

+minusminusminusminus

+minus+

+minusminuslele+

++=

lele+

+= +=+=

==rArr

=rArr

==forallrArr

==summinus=partpart

forallrArr

sum sum minusminussum= δTake derivation

Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors

mnminusUnm λλ 1+ R nm φφ 1+

ϕϕϕϕ RR 2=

partpart T

constraintsTo be minimized ⎩⎨⎧

ne=

=kjkj

jk if 0 if 1

δ

Objective function

ML-16

PCA Derivations (1013)

bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)

such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion

Encoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

2

1

x

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

Decoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

ˆˆˆ

ˆ2

1

x

( ) ( )( )-xx-xxE Tx ˆˆ minimize

xΦy Tprime=

[ ]m

T

eeeΦΦ

where

21=primeprime

yΦx prime=ˆ

Φ prime

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

ML-17

PCA Derivations (1113)

bull Data compression in communication

ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition

ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns

(To be discussed later on)

ML-18

PCA Derivations (1213)

bull Scree Graphndash The plot of variance as a function of the number of eigenvectors

keptbull Select such that

bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)

Thresholdnm

m ge+++++

+++λλλλ

λλλLL

L

21

21m

sumge=

n

iim n 1

1 λλ

ML-19

PCA Derivations (1313)

bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal

( ) WSWWSWSSSW bT

wT

bwJ +=+==~~~

WSWS wT

w =~

WSWS bT

b =~

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

( )( )sum minusminus=i

TiiN

xxxxS 1

sample index

class index

bw SSS += thatshowTry to

ML-20

PCA Examples Data Analysis

bull Example 1 principal components of some data points

ML-21

PCA Examples Feature Transformation

bull Example 2 feature transformation and selection

threshold for information content reserved

New feature dimensions

Correlation matrix for old feature dimensions

ML-22

PCA Examples Image Coding (12)

bull Example 3 Image Coding

256

256

8

8

ML-23

PCA Examples Image Coding (22)

bull Example 3 Image Coding (cont)(value reduction)(feature reduction)

ML-24

PCA Examples Eigenface (14)

bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)

ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images

ndash Stepsbull Convert each of the L reference images into a vector of

floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these

reference vectors bull Apply Principal Component Analysis (PCA) find the

eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are

called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nL

L

L

L

nn x

xx

x

xx

x

xx

2

1

2

22

12

2

1

21

11

1

xxx

ML-25

PCA Examples Eigenface (24)

bull Example 4 Eigenface in face recognition (cont) ndash Steps

bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces

ndash The Eigenface approach persists the minimum mean-squared error criterion

ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces

( ) ( ) ( )[ ]Kiiii

Kiiii

wwwKwww

21

21

121ˆ

=rArr

++++=

yeeexx

Feature vector of a person i

ML-26

PCA Examples Eigenface (34)

The averaged face

Face images as the training set

ML-27

PCA Examples Eigenface (44)

A projected Face imageSeven eigenfaces derived from the training set

(Indicate directions of variations between faces )

ML-28

PCA Examples Eigenvoice (13)

bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)

ndash Stepsbull Concatenating the regarded parameters for each speaker r to

form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)

Eigenvoice Eigenvoice space space

constructionconstruction

Speaker 1 Data

SI HMM

Speaker R Data

Model Training Model Training

Speaker 1 HMM Speaker R HMM

D = (Mn)times1 Principal Component

Analysis

Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace

( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=

SI HMM model

ML-29

PCA Examples Eigenvoice (23)

bull Example 4 Eigenvoice in speaker adaptation (cont)

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 10: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-10

PCA Derivations (413)

ndash Further assume the column (basis) vectors of the matrix form an orthonormal set

bull also has the following propertiesndash Its mean is zero too

ndash Its variance is

bull The correlation between two projections and is

Φiy

0==== 0xx Ti

Ti

Tii EEyE ϕϕϕ jy

( )( ) j

Tij

TTi

jTT

iTT

jTiji

E

EEyyE

Rφφφxxφ

φxxφxφxφ

==

==

[ ] [ ]xRR

xxxx

ofmatrix relation (auto-)cor theis

2222

iTi

iTT

iiTT

i EEyEyEyEii

i

ϕϕ

ϕϕϕϕσ

=

===minus=

( )( )

sum==

minus⎟⎠⎞

⎜⎝⎛

sumasymp

minusminus=

=

i

Tii

T

TN

i

Tii

T

NE

N

E

xxxxR

μμxx

μxμxΣ

1

1

][

1 0

iy jy

ML-11

PCA Derivations (513)

bull Minimum Mean-Squared Error Criterionndash We want to choose only m of that we still can

approximate well in mean-squared error criterionxsiφ

sumsumsum+===

+==n

mjjj

m

ii

n

ii yyy

111φφφx ii

( ) sum=

=m

iiym

1ˆ iφx

( ) ( )

sumsum

sum

sum sum

sumsum

+=+=

+=

+= +=

+=+=

==

=

⎭⎬⎫

⎩⎨⎧=

⎭⎬⎫

⎩⎨⎧

⎟⎠⎞⎜

⎝⎛⎟⎠⎞

⎜⎝⎛=minus=

n

mjj

Tj

n

mjj

n

mjj

kTj

n

mj

n

mkkj

kn

mkk

Tj

n

mjj

yE

yyE

yyEmEm

11

2

1

2

1 1

11

2

ˆ

R φφ

φφ

φφxx

σ

ε

⎩⎨⎧

ne=

=kjkj

kTj if 0

if 1φφQ

[ ] 2

222

0

j

j

yE

yEyE

yE

jj

j

=

minus=

=

σ

We should discard the

bases where the projections have lower variances

original vector

reconstructed vector

ML-12

PCA Derivations (613)

bull Minimum Mean-Squared Error Criterionndash If the orthonormal (basis) set is selected to be the

eigenvectors of the correlation matrix associated with eigenvalues

bull They will have the property that

ndash Such that the mean-squared error mentioned above will be

siφR

siλ

jjj φR φ λ=

( )

sumsumsum

sum

+=+=+=

+=

===

=

n

mjj

n

mjjj

Tj

n

mjj

Tj

n

mjjm

111

1

2

λλ

σε

φφR φφ

is real and symmetric therefore its eigenvectors

form a orthonormal setR

is positive definite ( )=gt all eigenvalues are positive

R 0gtRxxT

ML-13

PCA Derivations (713)

bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest

eigenvalues the mean-squared error will be

ndash Any two projections and will be mutually uncorrelated

bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices

( ) ( )0 where 11

gegegegegesum=+=

nmn

mjjeigen m λλλλε

iy jy

( )( ) 0 j ====

==

jTij

Tij

TTi

jTT

iTT

jTiji

E

EEyyE

φφRφφφxxφ

φxxφxφxφ

λ

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

sdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdot

sdotsdot

nnnn

n

σσσ

σσσσσ

21

2222

11211

( )( )

sum==

minus⎟⎠⎞

⎜⎝⎛

sumasymp

minusminus=

=

i

Tii

T

TN

i

Tii

T

NE

N

E

xxxxR

μμxx

μxμxΣ

1

1

][

1

ML-14

PCA Derivations (813)

bull An two-dimensional example of Principle Component Analysis

ML-15

PCA Derivations (913)

bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the

mean-squared error criterion ( ) meigenε

( )

[ ]( )[ ]( )

[ ] [ ][ ]( )nmmnmnmnmn

nmmnnm

nmmnjmnjnjm

jnmjTn

mkkjkj

jnjm

n

mj

n

mkjkk

Tjjk

n

mjj

Tj

μμμJ

μJ

j

μμUUΦRΦ

μμΦφφR

φφΦμΦRφ

μ0φRφφ

φφRφφ

where

where

where 22

Define

1

11

11

1 1

1

1 11

+minusminusminusminus

+minus+

+minusminuslele+

++=

lele+

+= +=+=

==rArr

=rArr

==forallrArr

==summinus=partpart

forallrArr

sum sum minusminussum= δTake derivation

Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors

mnminusUnm λλ 1+ R nm φφ 1+

ϕϕϕϕ RR 2=

partpart T

constraintsTo be minimized ⎩⎨⎧

ne=

=kjkj

jk if 0 if 1

δ

Objective function

ML-16

PCA Derivations (1013)

bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)

such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion

Encoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

2

1

x

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

Decoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

ˆˆˆ

ˆ2

1

x

( ) ( )( )-xx-xxE Tx ˆˆ minimize

xΦy Tprime=

[ ]m

T

eeeΦΦ

where

21=primeprime

yΦx prime=ˆ

Φ prime

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

ML-17

PCA Derivations (1113)

bull Data compression in communication

ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition

ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns

(To be discussed later on)

ML-18

PCA Derivations (1213)

bull Scree Graphndash The plot of variance as a function of the number of eigenvectors

keptbull Select such that

bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)

Thresholdnm

m ge+++++

+++λλλλ

λλλLL

L

21

21m

sumge=

n

iim n 1

1 λλ

ML-19

PCA Derivations (1313)

bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal

( ) WSWWSWSSSW bT

wT

bwJ +=+==~~~

WSWS wT

w =~

WSWS bT

b =~

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

( )( )sum minusminus=i

TiiN

xxxxS 1

sample index

class index

bw SSS += thatshowTry to

ML-20

PCA Examples Data Analysis

bull Example 1 principal components of some data points

ML-21

PCA Examples Feature Transformation

bull Example 2 feature transformation and selection

threshold for information content reserved

New feature dimensions

Correlation matrix for old feature dimensions

ML-22

PCA Examples Image Coding (12)

bull Example 3 Image Coding

256

256

8

8

ML-23

PCA Examples Image Coding (22)

bull Example 3 Image Coding (cont)(value reduction)(feature reduction)

ML-24

PCA Examples Eigenface (14)

bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)

ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images

ndash Stepsbull Convert each of the L reference images into a vector of

floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these

reference vectors bull Apply Principal Component Analysis (PCA) find the

eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are

called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nL

L

L

L

nn x

xx

x

xx

x

xx

2

1

2

22

12

2

1

21

11

1

xxx

ML-25

PCA Examples Eigenface (24)

bull Example 4 Eigenface in face recognition (cont) ndash Steps

bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces

ndash The Eigenface approach persists the minimum mean-squared error criterion

ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces

( ) ( ) ( )[ ]Kiiii

Kiiii

wwwKwww

21

21

121ˆ

=rArr

++++=

yeeexx

Feature vector of a person i

ML-26

PCA Examples Eigenface (34)

The averaged face

Face images as the training set

ML-27

PCA Examples Eigenface (44)

A projected Face imageSeven eigenfaces derived from the training set

(Indicate directions of variations between faces )

ML-28

PCA Examples Eigenvoice (13)

bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)

ndash Stepsbull Concatenating the regarded parameters for each speaker r to

form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)

Eigenvoice Eigenvoice space space

constructionconstruction

Speaker 1 Data

SI HMM

Speaker R Data

Model Training Model Training

Speaker 1 HMM Speaker R HMM

D = (Mn)times1 Principal Component

Analysis

Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace

( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=

SI HMM model

ML-29

PCA Examples Eigenvoice (23)

bull Example 4 Eigenvoice in speaker adaptation (cont)

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 11: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-11

PCA Derivations (513)

bull Minimum Mean-Squared Error Criterionndash We want to choose only m of that we still can

approximate well in mean-squared error criterionxsiφ

sumsumsum+===

+==n

mjjj

m

ii

n

ii yyy

111φφφx ii

( ) sum=

=m

iiym

1ˆ iφx

( ) ( )

sumsum

sum

sum sum

sumsum

+=+=

+=

+= +=

+=+=

==

=

⎭⎬⎫

⎩⎨⎧=

⎭⎬⎫

⎩⎨⎧

⎟⎠⎞⎜

⎝⎛⎟⎠⎞

⎜⎝⎛=minus=

n

mjj

Tj

n

mjj

n

mjj

kTj

n

mj

n

mkkj

kn

mkk

Tj

n

mjj

yE

yyE

yyEmEm

11

2

1

2

1 1

11

2

ˆ

R φφ

φφ

φφxx

σ

ε

⎩⎨⎧

ne=

=kjkj

kTj if 0

if 1φφQ

[ ] 2

222

0

j

j

yE

yEyE

yE

jj

j

=

minus=

=

σ

We should discard the

bases where the projections have lower variances

original vector

reconstructed vector

ML-12

PCA Derivations (613)

bull Minimum Mean-Squared Error Criterionndash If the orthonormal (basis) set is selected to be the

eigenvectors of the correlation matrix associated with eigenvalues

bull They will have the property that

ndash Such that the mean-squared error mentioned above will be

siφR

siλ

jjj φR φ λ=

( )

sumsumsum

sum

+=+=+=

+=

===

=

n

mjj

n

mjjj

Tj

n

mjj

Tj

n

mjjm

111

1

2

λλ

σε

φφR φφ

is real and symmetric therefore its eigenvectors

form a orthonormal setR

is positive definite ( )=gt all eigenvalues are positive

R 0gtRxxT

ML-13

PCA Derivations (713)

bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest

eigenvalues the mean-squared error will be

ndash Any two projections and will be mutually uncorrelated

bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices

( ) ( )0 where 11

gegegegegesum=+=

nmn

mjjeigen m λλλλε

iy jy

( )( ) 0 j ====

==

jTij

Tij

TTi

jTT

iTT

jTiji

E

EEyyE

φφRφφφxxφ

φxxφxφxφ

λ

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

sdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdot

sdotsdot

nnnn

n

σσσ

σσσσσ

21

2222

11211

( )( )

sum==

minus⎟⎠⎞

⎜⎝⎛

sumasymp

minusminus=

=

i

Tii

T

TN

i

Tii

T

NE

N

E

xxxxR

μμxx

μxμxΣ

1

1

][

1

ML-14

PCA Derivations (813)

bull An two-dimensional example of Principle Component Analysis

ML-15

PCA Derivations (913)

bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the

mean-squared error criterion ( ) meigenε

( )

[ ]( )[ ]( )

[ ] [ ][ ]( )nmmnmnmnmn

nmmnnm

nmmnjmnjnjm

jnmjTn

mkkjkj

jnjm

n

mj

n

mkjkk

Tjjk

n

mjj

Tj

μμμJ

μJ

j

μμUUΦRΦ

μμΦφφR

φφΦμΦRφ

μ0φRφφ

φφRφφ

where

where

where 22

Define

1

11

11

1 1

1

1 11

+minusminusminusminus

+minus+

+minusminuslele+

++=

lele+

+= +=+=

==rArr

=rArr

==forallrArr

==summinus=partpart

forallrArr

sum sum minusminussum= δTake derivation

Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors

mnminusUnm λλ 1+ R nm φφ 1+

ϕϕϕϕ RR 2=

partpart T

constraintsTo be minimized ⎩⎨⎧

ne=

=kjkj

jk if 0 if 1

δ

Objective function

ML-16

PCA Derivations (1013)

bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)

such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion

Encoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

2

1

x

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

Decoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

ˆˆˆ

ˆ2

1

x

( ) ( )( )-xx-xxE Tx ˆˆ minimize

xΦy Tprime=

[ ]m

T

eeeΦΦ

where

21=primeprime

yΦx prime=ˆ

Φ prime

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

ML-17

PCA Derivations (1113)

bull Data compression in communication

ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition

ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns

(To be discussed later on)

ML-18

PCA Derivations (1213)

bull Scree Graphndash The plot of variance as a function of the number of eigenvectors

keptbull Select such that

bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)

Thresholdnm

m ge+++++

+++λλλλ

λλλLL

L

21

21m

sumge=

n

iim n 1

1 λλ

ML-19

PCA Derivations (1313)

bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal

( ) WSWWSWSSSW bT

wT

bwJ +=+==~~~

WSWS wT

w =~

WSWS bT

b =~

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

( )( )sum minusminus=i

TiiN

xxxxS 1

sample index

class index

bw SSS += thatshowTry to

ML-20

PCA Examples Data Analysis

bull Example 1 principal components of some data points

ML-21

PCA Examples Feature Transformation

bull Example 2 feature transformation and selection

threshold for information content reserved

New feature dimensions

Correlation matrix for old feature dimensions

ML-22

PCA Examples Image Coding (12)

bull Example 3 Image Coding

256

256

8

8

ML-23

PCA Examples Image Coding (22)

bull Example 3 Image Coding (cont)(value reduction)(feature reduction)

ML-24

PCA Examples Eigenface (14)

bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)

ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images

ndash Stepsbull Convert each of the L reference images into a vector of

floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these

reference vectors bull Apply Principal Component Analysis (PCA) find the

eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are

called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nL

L

L

L

nn x

xx

x

xx

x

xx

2

1

2

22

12

2

1

21

11

1

xxx

ML-25

PCA Examples Eigenface (24)

bull Example 4 Eigenface in face recognition (cont) ndash Steps

bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces

ndash The Eigenface approach persists the minimum mean-squared error criterion

ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces

( ) ( ) ( )[ ]Kiiii

Kiiii

wwwKwww

21

21

121ˆ

=rArr

++++=

yeeexx

Feature vector of a person i

ML-26

PCA Examples Eigenface (34)

The averaged face

Face images as the training set

ML-27

PCA Examples Eigenface (44)

A projected Face imageSeven eigenfaces derived from the training set

(Indicate directions of variations between faces )

ML-28

PCA Examples Eigenvoice (13)

bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)

ndash Stepsbull Concatenating the regarded parameters for each speaker r to

form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)

Eigenvoice Eigenvoice space space

constructionconstruction

Speaker 1 Data

SI HMM

Speaker R Data

Model Training Model Training

Speaker 1 HMM Speaker R HMM

D = (Mn)times1 Principal Component

Analysis

Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace

( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=

SI HMM model

ML-29

PCA Examples Eigenvoice (23)

bull Example 4 Eigenvoice in speaker adaptation (cont)

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 12: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-12

PCA Derivations (613)

bull Minimum Mean-Squared Error Criterionndash If the orthonormal (basis) set is selected to be the

eigenvectors of the correlation matrix associated with eigenvalues

bull They will have the property that

ndash Such that the mean-squared error mentioned above will be

siφR

siλ

jjj φR φ λ=

( )

sumsumsum

sum

+=+=+=

+=

===

=

n

mjj

n

mjjj

Tj

n

mjj

Tj

n

mjjm

111

1

2

λλ

σε

φφR φφ

is real and symmetric therefore its eigenvectors

form a orthonormal setR

is positive definite ( )=gt all eigenvalues are positive

R 0gtRxxT

ML-13

PCA Derivations (713)

bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest

eigenvalues the mean-squared error will be

ndash Any two projections and will be mutually uncorrelated

bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices

( ) ( )0 where 11

gegegegegesum=+=

nmn

mjjeigen m λλλλε

iy jy

( )( ) 0 j ====

==

jTij

Tij

TTi

jTT

iTT

jTiji

E

EEyyE

φφRφφφxxφ

φxxφxφxφ

λ

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

sdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdot

sdotsdot

nnnn

n

σσσ

σσσσσ

21

2222

11211

( )( )

sum==

minus⎟⎠⎞

⎜⎝⎛

sumasymp

minusminus=

=

i

Tii

T

TN

i

Tii

T

NE

N

E

xxxxR

μμxx

μxμxΣ

1

1

][

1

ML-14

PCA Derivations (813)

bull An two-dimensional example of Principle Component Analysis

ML-15

PCA Derivations (913)

bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the

mean-squared error criterion ( ) meigenε

( )

[ ]( )[ ]( )

[ ] [ ][ ]( )nmmnmnmnmn

nmmnnm

nmmnjmnjnjm

jnmjTn

mkkjkj

jnjm

n

mj

n

mkjkk

Tjjk

n

mjj

Tj

μμμJ

μJ

j

μμUUΦRΦ

μμΦφφR

φφΦμΦRφ

μ0φRφφ

φφRφφ

where

where

where 22

Define

1

11

11

1 1

1

1 11

+minusminusminusminus

+minus+

+minusminuslele+

++=

lele+

+= +=+=

==rArr

=rArr

==forallrArr

==summinus=partpart

forallrArr

sum sum minusminussum= δTake derivation

Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors

mnminusUnm λλ 1+ R nm φφ 1+

ϕϕϕϕ RR 2=

partpart T

constraintsTo be minimized ⎩⎨⎧

ne=

=kjkj

jk if 0 if 1

δ

Objective function

ML-16

PCA Derivations (1013)

bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)

such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion

Encoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

2

1

x

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

Decoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

ˆˆˆ

ˆ2

1

x

( ) ( )( )-xx-xxE Tx ˆˆ minimize

xΦy Tprime=

[ ]m

T

eeeΦΦ

where

21=primeprime

yΦx prime=ˆ

Φ prime

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

ML-17

PCA Derivations (1113)

bull Data compression in communication

ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition

ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns

(To be discussed later on)

ML-18

PCA Derivations (1213)

bull Scree Graphndash The plot of variance as a function of the number of eigenvectors

keptbull Select such that

bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)

Thresholdnm

m ge+++++

+++λλλλ

λλλLL

L

21

21m

sumge=

n

iim n 1

1 λλ

ML-19

PCA Derivations (1313)

bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal

( ) WSWWSWSSSW bT

wT

bwJ +=+==~~~

WSWS wT

w =~

WSWS bT

b =~

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

( )( )sum minusminus=i

TiiN

xxxxS 1

sample index

class index

bw SSS += thatshowTry to

ML-20

PCA Examples Data Analysis

bull Example 1 principal components of some data points

ML-21

PCA Examples Feature Transformation

bull Example 2 feature transformation and selection

threshold for information content reserved

New feature dimensions

Correlation matrix for old feature dimensions

ML-22

PCA Examples Image Coding (12)

bull Example 3 Image Coding

256

256

8

8

ML-23

PCA Examples Image Coding (22)

bull Example 3 Image Coding (cont)(value reduction)(feature reduction)

ML-24

PCA Examples Eigenface (14)

bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)

ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images

ndash Stepsbull Convert each of the L reference images into a vector of

floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these

reference vectors bull Apply Principal Component Analysis (PCA) find the

eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are

called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nL

L

L

L

nn x

xx

x

xx

x

xx

2

1

2

22

12

2

1

21

11

1

xxx

ML-25

PCA Examples Eigenface (24)

bull Example 4 Eigenface in face recognition (cont) ndash Steps

bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces

ndash The Eigenface approach persists the minimum mean-squared error criterion

ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces

( ) ( ) ( )[ ]Kiiii

Kiiii

wwwKwww

21

21

121ˆ

=rArr

++++=

yeeexx

Feature vector of a person i

ML-26

PCA Examples Eigenface (34)

The averaged face

Face images as the training set

ML-27

PCA Examples Eigenface (44)

A projected Face imageSeven eigenfaces derived from the training set

(Indicate directions of variations between faces )

ML-28

PCA Examples Eigenvoice (13)

bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)

ndash Stepsbull Concatenating the regarded parameters for each speaker r to

form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)

Eigenvoice Eigenvoice space space

constructionconstruction

Speaker 1 Data

SI HMM

Speaker R Data

Model Training Model Training

Speaker 1 HMM Speaker R HMM

D = (Mn)times1 Principal Component

Analysis

Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace

( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=

SI HMM model

ML-29

PCA Examples Eigenvoice (23)

bull Example 4 Eigenvoice in speaker adaptation (cont)

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 13: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-13

PCA Derivations (713)

bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest

eigenvalues the mean-squared error will be

ndash Any two projections and will be mutually uncorrelated

bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices

( ) ( )0 where 11

gegegegegesum=+=

nmn

mjjeigen m λλλλε

iy jy

( )( ) 0 j ====

==

jTij

Tij

TTi

jTT

iTT

jTiji

E

EEyyE

φφRφφφxxφ

φxxφxφxφ

λ

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

sdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdot

sdotsdot

nnnn

n

σσσ

σσσσσ

21

2222

11211

( )( )

sum==

minus⎟⎠⎞

⎜⎝⎛

sumasymp

minusminus=

=

i

Tii

T

TN

i

Tii

T

NE

N

E

xxxxR

μμxx

μxμxΣ

1

1

][

1

ML-14

PCA Derivations (813)

bull An two-dimensional example of Principle Component Analysis

ML-15

PCA Derivations (913)

bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the

mean-squared error criterion ( ) meigenε

( )

[ ]( )[ ]( )

[ ] [ ][ ]( )nmmnmnmnmn

nmmnnm

nmmnjmnjnjm

jnmjTn

mkkjkj

jnjm

n

mj

n

mkjkk

Tjjk

n

mjj

Tj

μμμJ

μJ

j

μμUUΦRΦ

μμΦφφR

φφΦμΦRφ

μ0φRφφ

φφRφφ

where

where

where 22

Define

1

11

11

1 1

1

1 11

+minusminusminusminus

+minus+

+minusminuslele+

++=

lele+

+= +=+=

==rArr

=rArr

==forallrArr

==summinus=partpart

forallrArr

sum sum minusminussum= δTake derivation

Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors

mnminusUnm λλ 1+ R nm φφ 1+

ϕϕϕϕ RR 2=

partpart T

constraintsTo be minimized ⎩⎨⎧

ne=

=kjkj

jk if 0 if 1

δ

Objective function

ML-16

PCA Derivations (1013)

bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)

such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion

Encoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

2

1

x

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

Decoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

ˆˆˆ

ˆ2

1

x

( ) ( )( )-xx-xxE Tx ˆˆ minimize

xΦy Tprime=

[ ]m

T

eeeΦΦ

where

21=primeprime

yΦx prime=ˆ

Φ prime

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

ML-17

PCA Derivations (1113)

bull Data compression in communication

ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition

ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns

(To be discussed later on)

ML-18

PCA Derivations (1213)

bull Scree Graphndash The plot of variance as a function of the number of eigenvectors

keptbull Select such that

bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)

Thresholdnm

m ge+++++

+++λλλλ

λλλLL

L

21

21m

sumge=

n

iim n 1

1 λλ

ML-19

PCA Derivations (1313)

bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal

( ) WSWWSWSSSW bT

wT

bwJ +=+==~~~

WSWS wT

w =~

WSWS bT

b =~

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

( )( )sum minusminus=i

TiiN

xxxxS 1

sample index

class index

bw SSS += thatshowTry to

ML-20

PCA Examples Data Analysis

bull Example 1 principal components of some data points

ML-21

PCA Examples Feature Transformation

bull Example 2 feature transformation and selection

threshold for information content reserved

New feature dimensions

Correlation matrix for old feature dimensions

ML-22

PCA Examples Image Coding (12)

bull Example 3 Image Coding

256

256

8

8

ML-23

PCA Examples Image Coding (22)

bull Example 3 Image Coding (cont)(value reduction)(feature reduction)

ML-24

PCA Examples Eigenface (14)

bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)

ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images

ndash Stepsbull Convert each of the L reference images into a vector of

floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these

reference vectors bull Apply Principal Component Analysis (PCA) find the

eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are

called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nL

L

L

L

nn x

xx

x

xx

x

xx

2

1

2

22

12

2

1

21

11

1

xxx

ML-25

PCA Examples Eigenface (24)

bull Example 4 Eigenface in face recognition (cont) ndash Steps

bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces

ndash The Eigenface approach persists the minimum mean-squared error criterion

ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces

( ) ( ) ( )[ ]Kiiii

Kiiii

wwwKwww

21

21

121ˆ

=rArr

++++=

yeeexx

Feature vector of a person i

ML-26

PCA Examples Eigenface (34)

The averaged face

Face images as the training set

ML-27

PCA Examples Eigenface (44)

A projected Face imageSeven eigenfaces derived from the training set

(Indicate directions of variations between faces )

ML-28

PCA Examples Eigenvoice (13)

bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)

ndash Stepsbull Concatenating the regarded parameters for each speaker r to

form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)

Eigenvoice Eigenvoice space space

constructionconstruction

Speaker 1 Data

SI HMM

Speaker R Data

Model Training Model Training

Speaker 1 HMM Speaker R HMM

D = (Mn)times1 Principal Component

Analysis

Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace

( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=

SI HMM model

ML-29

PCA Examples Eigenvoice (23)

bull Example 4 Eigenvoice in speaker adaptation (cont)

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 14: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-14

PCA Derivations (813)

bull An two-dimensional example of Principle Component Analysis

ML-15

PCA Derivations (913)

bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the

mean-squared error criterion ( ) meigenε

( )

[ ]( )[ ]( )

[ ] [ ][ ]( )nmmnmnmnmn

nmmnnm

nmmnjmnjnjm

jnmjTn

mkkjkj

jnjm

n

mj

n

mkjkk

Tjjk

n

mjj

Tj

μμμJ

μJ

j

μμUUΦRΦ

μμΦφφR

φφΦμΦRφ

μ0φRφφ

φφRφφ

where

where

where 22

Define

1

11

11

1 1

1

1 11

+minusminusminusminus

+minus+

+minusminuslele+

++=

lele+

+= +=+=

==rArr

=rArr

==forallrArr

==summinus=partpart

forallrArr

sum sum minusminussum= δTake derivation

Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors

mnminusUnm λλ 1+ R nm φφ 1+

ϕϕϕϕ RR 2=

partpart T

constraintsTo be minimized ⎩⎨⎧

ne=

=kjkj

jk if 0 if 1

δ

Objective function

ML-16

PCA Derivations (1013)

bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)

such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion

Encoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

2

1

x

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

Decoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

ˆˆˆ

ˆ2

1

x

( ) ( )( )-xx-xxE Tx ˆˆ minimize

xΦy Tprime=

[ ]m

T

eeeΦΦ

where

21=primeprime

yΦx prime=ˆ

Φ prime

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

ML-17

PCA Derivations (1113)

bull Data compression in communication

ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition

ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns

(To be discussed later on)

ML-18

PCA Derivations (1213)

bull Scree Graphndash The plot of variance as a function of the number of eigenvectors

keptbull Select such that

bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)

Thresholdnm

m ge+++++

+++λλλλ

λλλLL

L

21

21m

sumge=

n

iim n 1

1 λλ

ML-19

PCA Derivations (1313)

bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal

( ) WSWWSWSSSW bT

wT

bwJ +=+==~~~

WSWS wT

w =~

WSWS bT

b =~

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

( )( )sum minusminus=i

TiiN

xxxxS 1

sample index

class index

bw SSS += thatshowTry to

ML-20

PCA Examples Data Analysis

bull Example 1 principal components of some data points

ML-21

PCA Examples Feature Transformation

bull Example 2 feature transformation and selection

threshold for information content reserved

New feature dimensions

Correlation matrix for old feature dimensions

ML-22

PCA Examples Image Coding (12)

bull Example 3 Image Coding

256

256

8

8

ML-23

PCA Examples Image Coding (22)

bull Example 3 Image Coding (cont)(value reduction)(feature reduction)

ML-24

PCA Examples Eigenface (14)

bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)

ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images

ndash Stepsbull Convert each of the L reference images into a vector of

floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these

reference vectors bull Apply Principal Component Analysis (PCA) find the

eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are

called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nL

L

L

L

nn x

xx

x

xx

x

xx

2

1

2

22

12

2

1

21

11

1

xxx

ML-25

PCA Examples Eigenface (24)

bull Example 4 Eigenface in face recognition (cont) ndash Steps

bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces

ndash The Eigenface approach persists the minimum mean-squared error criterion

ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces

( ) ( ) ( )[ ]Kiiii

Kiiii

wwwKwww

21

21

121ˆ

=rArr

++++=

yeeexx

Feature vector of a person i

ML-26

PCA Examples Eigenface (34)

The averaged face

Face images as the training set

ML-27

PCA Examples Eigenface (44)

A projected Face imageSeven eigenfaces derived from the training set

(Indicate directions of variations between faces )

ML-28

PCA Examples Eigenvoice (13)

bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)

ndash Stepsbull Concatenating the regarded parameters for each speaker r to

form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)

Eigenvoice Eigenvoice space space

constructionconstruction

Speaker 1 Data

SI HMM

Speaker R Data

Model Training Model Training

Speaker 1 HMM Speaker R HMM

D = (Mn)times1 Principal Component

Analysis

Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace

( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=

SI HMM model

ML-29

PCA Examples Eigenvoice (23)

bull Example 4 Eigenvoice in speaker adaptation (cont)

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 15: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-15

PCA Derivations (913)

bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the

mean-squared error criterion ( ) meigenε

( )

[ ]( )[ ]( )

[ ] [ ][ ]( )nmmnmnmnmn

nmmnnm

nmmnjmnjnjm

jnmjTn

mkkjkj

jnjm

n

mj

n

mkjkk

Tjjk

n

mjj

Tj

μμμJ

μJ

j

μμUUΦRΦ

μμΦφφR

φφΦμΦRφ

μ0φRφφ

φφRφφ

where

where

where 22

Define

1

11

11

1 1

1

1 11

+minusminusminusminus

+minus+

+minusminuslele+

++=

lele+

+= +=+=

==rArr

=rArr

==forallrArr

==summinus=partpart

forallrArr

sum sum minusminussum= δTake derivation

Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors

mnminusUnm λλ 1+ R nm φφ 1+

ϕϕϕϕ RR 2=

partpart T

constraintsTo be minimized ⎩⎨⎧

ne=

=kjkj

jk if 0 if 1

δ

Objective function

ML-16

PCA Derivations (1013)

bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)

such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion

Encoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

2

1

x

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

Decoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

ˆˆˆ

ˆ2

1

x

( ) ( )( )-xx-xxE Tx ˆˆ minimize

xΦy Tprime=

[ ]m

T

eeeΦΦ

where

21=primeprime

yΦx prime=ˆ

Φ prime

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

ML-17

PCA Derivations (1113)

bull Data compression in communication

ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition

ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns

(To be discussed later on)

ML-18

PCA Derivations (1213)

bull Scree Graphndash The plot of variance as a function of the number of eigenvectors

keptbull Select such that

bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)

Thresholdnm

m ge+++++

+++λλλλ

λλλLL

L

21

21m

sumge=

n

iim n 1

1 λλ

ML-19

PCA Derivations (1313)

bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal

( ) WSWWSWSSSW bT

wT

bwJ +=+==~~~

WSWS wT

w =~

WSWS bT

b =~

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

( )( )sum minusminus=i

TiiN

xxxxS 1

sample index

class index

bw SSS += thatshowTry to

ML-20

PCA Examples Data Analysis

bull Example 1 principal components of some data points

ML-21

PCA Examples Feature Transformation

bull Example 2 feature transformation and selection

threshold for information content reserved

New feature dimensions

Correlation matrix for old feature dimensions

ML-22

PCA Examples Image Coding (12)

bull Example 3 Image Coding

256

256

8

8

ML-23

PCA Examples Image Coding (22)

bull Example 3 Image Coding (cont)(value reduction)(feature reduction)

ML-24

PCA Examples Eigenface (14)

bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)

ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images

ndash Stepsbull Convert each of the L reference images into a vector of

floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these

reference vectors bull Apply Principal Component Analysis (PCA) find the

eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are

called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nL

L

L

L

nn x

xx

x

xx

x

xx

2

1

2

22

12

2

1

21

11

1

xxx

ML-25

PCA Examples Eigenface (24)

bull Example 4 Eigenface in face recognition (cont) ndash Steps

bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces

ndash The Eigenface approach persists the minimum mean-squared error criterion

ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces

( ) ( ) ( )[ ]Kiiii

Kiiii

wwwKwww

21

21

121ˆ

=rArr

++++=

yeeexx

Feature vector of a person i

ML-26

PCA Examples Eigenface (34)

The averaged face

Face images as the training set

ML-27

PCA Examples Eigenface (44)

A projected Face imageSeven eigenfaces derived from the training set

(Indicate directions of variations between faces )

ML-28

PCA Examples Eigenvoice (13)

bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)

ndash Stepsbull Concatenating the regarded parameters for each speaker r to

form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)

Eigenvoice Eigenvoice space space

constructionconstruction

Speaker 1 Data

SI HMM

Speaker R Data

Model Training Model Training

Speaker 1 HMM Speaker R HMM

D = (Mn)times1 Principal Component

Analysis

Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace

( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=

SI HMM model

ML-29

PCA Examples Eigenvoice (23)

bull Example 4 Eigenvoice in speaker adaptation (cont)

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 16: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-16

PCA Derivations (1013)

bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)

such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion

Encoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

2

1

x

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

Decoder

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nx

xx

ˆˆˆ

ˆ2

1

x

( ) ( )( )-xx-xxE Tx ˆˆ minimize

xΦy Tprime=

[ ]m

T

eeeΦΦ

where

21=primeprime

yΦx prime=ˆ

Φ prime

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

my

yy

2

1

y

ML-17

PCA Derivations (1113)

bull Data compression in communication

ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition

ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns

(To be discussed later on)

ML-18

PCA Derivations (1213)

bull Scree Graphndash The plot of variance as a function of the number of eigenvectors

keptbull Select such that

bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)

Thresholdnm

m ge+++++

+++λλλλ

λλλLL

L

21

21m

sumge=

n

iim n 1

1 λλ

ML-19

PCA Derivations (1313)

bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal

( ) WSWWSWSSSW bT

wT

bwJ +=+==~~~

WSWS wT

w =~

WSWS bT

b =~

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

( )( )sum minusminus=i

TiiN

xxxxS 1

sample index

class index

bw SSS += thatshowTry to

ML-20

PCA Examples Data Analysis

bull Example 1 principal components of some data points

ML-21

PCA Examples Feature Transformation

bull Example 2 feature transformation and selection

threshold for information content reserved

New feature dimensions

Correlation matrix for old feature dimensions

ML-22

PCA Examples Image Coding (12)

bull Example 3 Image Coding

256

256

8

8

ML-23

PCA Examples Image Coding (22)

bull Example 3 Image Coding (cont)(value reduction)(feature reduction)

ML-24

PCA Examples Eigenface (14)

bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)

ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images

ndash Stepsbull Convert each of the L reference images into a vector of

floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these

reference vectors bull Apply Principal Component Analysis (PCA) find the

eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are

called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nL

L

L

L

nn x

xx

x

xx

x

xx

2

1

2

22

12

2

1

21

11

1

xxx

ML-25

PCA Examples Eigenface (24)

bull Example 4 Eigenface in face recognition (cont) ndash Steps

bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces

ndash The Eigenface approach persists the minimum mean-squared error criterion

ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces

( ) ( ) ( )[ ]Kiiii

Kiiii

wwwKwww

21

21

121ˆ

=rArr

++++=

yeeexx

Feature vector of a person i

ML-26

PCA Examples Eigenface (34)

The averaged face

Face images as the training set

ML-27

PCA Examples Eigenface (44)

A projected Face imageSeven eigenfaces derived from the training set

(Indicate directions of variations between faces )

ML-28

PCA Examples Eigenvoice (13)

bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)

ndash Stepsbull Concatenating the regarded parameters for each speaker r to

form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)

Eigenvoice Eigenvoice space space

constructionconstruction

Speaker 1 Data

SI HMM

Speaker R Data

Model Training Model Training

Speaker 1 HMM Speaker R HMM

D = (Mn)times1 Principal Component

Analysis

Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace

( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=

SI HMM model

ML-29

PCA Examples Eigenvoice (23)

bull Example 4 Eigenvoice in speaker adaptation (cont)

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 17: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-17

PCA Derivations (1113)

bull Data compression in communication

ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition

ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns

(To be discussed later on)

ML-18

PCA Derivations (1213)

bull Scree Graphndash The plot of variance as a function of the number of eigenvectors

keptbull Select such that

bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)

Thresholdnm

m ge+++++

+++λλλλ

λλλLL

L

21

21m

sumge=

n

iim n 1

1 λλ

ML-19

PCA Derivations (1313)

bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal

( ) WSWWSWSSSW bT

wT

bwJ +=+==~~~

WSWS wT

w =~

WSWS bT

b =~

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

( )( )sum minusminus=i

TiiN

xxxxS 1

sample index

class index

bw SSS += thatshowTry to

ML-20

PCA Examples Data Analysis

bull Example 1 principal components of some data points

ML-21

PCA Examples Feature Transformation

bull Example 2 feature transformation and selection

threshold for information content reserved

New feature dimensions

Correlation matrix for old feature dimensions

ML-22

PCA Examples Image Coding (12)

bull Example 3 Image Coding

256

256

8

8

ML-23

PCA Examples Image Coding (22)

bull Example 3 Image Coding (cont)(value reduction)(feature reduction)

ML-24

PCA Examples Eigenface (14)

bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)

ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images

ndash Stepsbull Convert each of the L reference images into a vector of

floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these

reference vectors bull Apply Principal Component Analysis (PCA) find the

eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are

called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nL

L

L

L

nn x

xx

x

xx

x

xx

2

1

2

22

12

2

1

21

11

1

xxx

ML-25

PCA Examples Eigenface (24)

bull Example 4 Eigenface in face recognition (cont) ndash Steps

bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces

ndash The Eigenface approach persists the minimum mean-squared error criterion

ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces

( ) ( ) ( )[ ]Kiiii

Kiiii

wwwKwww

21

21

121ˆ

=rArr

++++=

yeeexx

Feature vector of a person i

ML-26

PCA Examples Eigenface (34)

The averaged face

Face images as the training set

ML-27

PCA Examples Eigenface (44)

A projected Face imageSeven eigenfaces derived from the training set

(Indicate directions of variations between faces )

ML-28

PCA Examples Eigenvoice (13)

bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)

ndash Stepsbull Concatenating the regarded parameters for each speaker r to

form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)

Eigenvoice Eigenvoice space space

constructionconstruction

Speaker 1 Data

SI HMM

Speaker R Data

Model Training Model Training

Speaker 1 HMM Speaker R HMM

D = (Mn)times1 Principal Component

Analysis

Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace

( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=

SI HMM model

ML-29

PCA Examples Eigenvoice (23)

bull Example 4 Eigenvoice in speaker adaptation (cont)

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 18: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-18

PCA Derivations (1213)

bull Scree Graphndash The plot of variance as a function of the number of eigenvectors

keptbull Select such that

bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)

Thresholdnm

m ge+++++

+++λλλλ

λλλLL

L

21

21m

sumge=

n

iim n 1

1 λλ

ML-19

PCA Derivations (1313)

bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal

( ) WSWWSWSSSW bT

wT

bwJ +=+==~~~

WSWS wT

w =~

WSWS bT

b =~

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

( )( )sum minusminus=i

TiiN

xxxxS 1

sample index

class index

bw SSS += thatshowTry to

ML-20

PCA Examples Data Analysis

bull Example 1 principal components of some data points

ML-21

PCA Examples Feature Transformation

bull Example 2 feature transformation and selection

threshold for information content reserved

New feature dimensions

Correlation matrix for old feature dimensions

ML-22

PCA Examples Image Coding (12)

bull Example 3 Image Coding

256

256

8

8

ML-23

PCA Examples Image Coding (22)

bull Example 3 Image Coding (cont)(value reduction)(feature reduction)

ML-24

PCA Examples Eigenface (14)

bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)

ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images

ndash Stepsbull Convert each of the L reference images into a vector of

floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these

reference vectors bull Apply Principal Component Analysis (PCA) find the

eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are

called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nL

L

L

L

nn x

xx

x

xx

x

xx

2

1

2

22

12

2

1

21

11

1

xxx

ML-25

PCA Examples Eigenface (24)

bull Example 4 Eigenface in face recognition (cont) ndash Steps

bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces

ndash The Eigenface approach persists the minimum mean-squared error criterion

ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces

( ) ( ) ( )[ ]Kiiii

Kiiii

wwwKwww

21

21

121ˆ

=rArr

++++=

yeeexx

Feature vector of a person i

ML-26

PCA Examples Eigenface (34)

The averaged face

Face images as the training set

ML-27

PCA Examples Eigenface (44)

A projected Face imageSeven eigenfaces derived from the training set

(Indicate directions of variations between faces )

ML-28

PCA Examples Eigenvoice (13)

bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)

ndash Stepsbull Concatenating the regarded parameters for each speaker r to

form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)

Eigenvoice Eigenvoice space space

constructionconstruction

Speaker 1 Data

SI HMM

Speaker R Data

Model Training Model Training

Speaker 1 HMM Speaker R HMM

D = (Mn)times1 Principal Component

Analysis

Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace

( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=

SI HMM model

ML-29

PCA Examples Eigenvoice (23)

bull Example 4 Eigenvoice in speaker adaptation (cont)

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 19: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-19

PCA Derivations (1313)

bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal

( ) WSWWSWSSSW bT

wT

bwJ +=+==~~~

WSWS wT

w =~

WSWS bT

b =~

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

( )( )sum minusminus=i

TiiN

xxxxS 1

sample index

class index

bw SSS += thatshowTry to

ML-20

PCA Examples Data Analysis

bull Example 1 principal components of some data points

ML-21

PCA Examples Feature Transformation

bull Example 2 feature transformation and selection

threshold for information content reserved

New feature dimensions

Correlation matrix for old feature dimensions

ML-22

PCA Examples Image Coding (12)

bull Example 3 Image Coding

256

256

8

8

ML-23

PCA Examples Image Coding (22)

bull Example 3 Image Coding (cont)(value reduction)(feature reduction)

ML-24

PCA Examples Eigenface (14)

bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)

ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images

ndash Stepsbull Convert each of the L reference images into a vector of

floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these

reference vectors bull Apply Principal Component Analysis (PCA) find the

eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are

called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nL

L

L

L

nn x

xx

x

xx

x

xx

2

1

2

22

12

2

1

21

11

1

xxx

ML-25

PCA Examples Eigenface (24)

bull Example 4 Eigenface in face recognition (cont) ndash Steps

bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces

ndash The Eigenface approach persists the minimum mean-squared error criterion

ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces

( ) ( ) ( )[ ]Kiiii

Kiiii

wwwKwww

21

21

121ˆ

=rArr

++++=

yeeexx

Feature vector of a person i

ML-26

PCA Examples Eigenface (34)

The averaged face

Face images as the training set

ML-27

PCA Examples Eigenface (44)

A projected Face imageSeven eigenfaces derived from the training set

(Indicate directions of variations between faces )

ML-28

PCA Examples Eigenvoice (13)

bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)

ndash Stepsbull Concatenating the regarded parameters for each speaker r to

form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)

Eigenvoice Eigenvoice space space

constructionconstruction

Speaker 1 Data

SI HMM

Speaker R Data

Model Training Model Training

Speaker 1 HMM Speaker R HMM

D = (Mn)times1 Principal Component

Analysis

Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace

( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=

SI HMM model

ML-29

PCA Examples Eigenvoice (23)

bull Example 4 Eigenvoice in speaker adaptation (cont)

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 20: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-20

PCA Examples Data Analysis

bull Example 1 principal components of some data points

ML-21

PCA Examples Feature Transformation

bull Example 2 feature transformation and selection

threshold for information content reserved

New feature dimensions

Correlation matrix for old feature dimensions

ML-22

PCA Examples Image Coding (12)

bull Example 3 Image Coding

256

256

8

8

ML-23

PCA Examples Image Coding (22)

bull Example 3 Image Coding (cont)(value reduction)(feature reduction)

ML-24

PCA Examples Eigenface (14)

bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)

ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images

ndash Stepsbull Convert each of the L reference images into a vector of

floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these

reference vectors bull Apply Principal Component Analysis (PCA) find the

eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are

called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nL

L

L

L

nn x

xx

x

xx

x

xx

2

1

2

22

12

2

1

21

11

1

xxx

ML-25

PCA Examples Eigenface (24)

bull Example 4 Eigenface in face recognition (cont) ndash Steps

bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces

ndash The Eigenface approach persists the minimum mean-squared error criterion

ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces

( ) ( ) ( )[ ]Kiiii

Kiiii

wwwKwww

21

21

121ˆ

=rArr

++++=

yeeexx

Feature vector of a person i

ML-26

PCA Examples Eigenface (34)

The averaged face

Face images as the training set

ML-27

PCA Examples Eigenface (44)

A projected Face imageSeven eigenfaces derived from the training set

(Indicate directions of variations between faces )

ML-28

PCA Examples Eigenvoice (13)

bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)

ndash Stepsbull Concatenating the regarded parameters for each speaker r to

form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)

Eigenvoice Eigenvoice space space

constructionconstruction

Speaker 1 Data

SI HMM

Speaker R Data

Model Training Model Training

Speaker 1 HMM Speaker R HMM

D = (Mn)times1 Principal Component

Analysis

Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace

( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=

SI HMM model

ML-29

PCA Examples Eigenvoice (23)

bull Example 4 Eigenvoice in speaker adaptation (cont)

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 21: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-21

PCA Examples Feature Transformation

bull Example 2 feature transformation and selection

threshold for information content reserved

New feature dimensions

Correlation matrix for old feature dimensions

ML-22

PCA Examples Image Coding (12)

bull Example 3 Image Coding

256

256

8

8

ML-23

PCA Examples Image Coding (22)

bull Example 3 Image Coding (cont)(value reduction)(feature reduction)

ML-24

PCA Examples Eigenface (14)

bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)

ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images

ndash Stepsbull Convert each of the L reference images into a vector of

floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these

reference vectors bull Apply Principal Component Analysis (PCA) find the

eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are

called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nL

L

L

L

nn x

xx

x

xx

x

xx

2

1

2

22

12

2

1

21

11

1

xxx

ML-25

PCA Examples Eigenface (24)

bull Example 4 Eigenface in face recognition (cont) ndash Steps

bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces

ndash The Eigenface approach persists the minimum mean-squared error criterion

ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces

( ) ( ) ( )[ ]Kiiii

Kiiii

wwwKwww

21

21

121ˆ

=rArr

++++=

yeeexx

Feature vector of a person i

ML-26

PCA Examples Eigenface (34)

The averaged face

Face images as the training set

ML-27

PCA Examples Eigenface (44)

A projected Face imageSeven eigenfaces derived from the training set

(Indicate directions of variations between faces )

ML-28

PCA Examples Eigenvoice (13)

bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)

ndash Stepsbull Concatenating the regarded parameters for each speaker r to

form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)

Eigenvoice Eigenvoice space space

constructionconstruction

Speaker 1 Data

SI HMM

Speaker R Data

Model Training Model Training

Speaker 1 HMM Speaker R HMM

D = (Mn)times1 Principal Component

Analysis

Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace

( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=

SI HMM model

ML-29

PCA Examples Eigenvoice (23)

bull Example 4 Eigenvoice in speaker adaptation (cont)

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 22: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-22

PCA Examples Image Coding (12)

bull Example 3 Image Coding

256

256

8

8

ML-23

PCA Examples Image Coding (22)

bull Example 3 Image Coding (cont)(value reduction)(feature reduction)

ML-24

PCA Examples Eigenface (14)

bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)

ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images

ndash Stepsbull Convert each of the L reference images into a vector of

floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these

reference vectors bull Apply Principal Component Analysis (PCA) find the

eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are

called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nL

L

L

L

nn x

xx

x

xx

x

xx

2

1

2

22

12

2

1

21

11

1

xxx

ML-25

PCA Examples Eigenface (24)

bull Example 4 Eigenface in face recognition (cont) ndash Steps

bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces

ndash The Eigenface approach persists the minimum mean-squared error criterion

ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces

( ) ( ) ( )[ ]Kiiii

Kiiii

wwwKwww

21

21

121ˆ

=rArr

++++=

yeeexx

Feature vector of a person i

ML-26

PCA Examples Eigenface (34)

The averaged face

Face images as the training set

ML-27

PCA Examples Eigenface (44)

A projected Face imageSeven eigenfaces derived from the training set

(Indicate directions of variations between faces )

ML-28

PCA Examples Eigenvoice (13)

bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)

ndash Stepsbull Concatenating the regarded parameters for each speaker r to

form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)

Eigenvoice Eigenvoice space space

constructionconstruction

Speaker 1 Data

SI HMM

Speaker R Data

Model Training Model Training

Speaker 1 HMM Speaker R HMM

D = (Mn)times1 Principal Component

Analysis

Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace

( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=

SI HMM model

ML-29

PCA Examples Eigenvoice (23)

bull Example 4 Eigenvoice in speaker adaptation (cont)

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 23: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-23

PCA Examples Image Coding (22)

bull Example 3 Image Coding (cont)(value reduction)(feature reduction)

ML-24

PCA Examples Eigenface (14)

bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)

ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images

ndash Stepsbull Convert each of the L reference images into a vector of

floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these

reference vectors bull Apply Principal Component Analysis (PCA) find the

eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are

called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nL

L

L

L

nn x

xx

x

xx

x

xx

2

1

2

22

12

2

1

21

11

1

xxx

ML-25

PCA Examples Eigenface (24)

bull Example 4 Eigenface in face recognition (cont) ndash Steps

bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces

ndash The Eigenface approach persists the minimum mean-squared error criterion

ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces

( ) ( ) ( )[ ]Kiiii

Kiiii

wwwKwww

21

21

121ˆ

=rArr

++++=

yeeexx

Feature vector of a person i

ML-26

PCA Examples Eigenface (34)

The averaged face

Face images as the training set

ML-27

PCA Examples Eigenface (44)

A projected Face imageSeven eigenfaces derived from the training set

(Indicate directions of variations between faces )

ML-28

PCA Examples Eigenvoice (13)

bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)

ndash Stepsbull Concatenating the regarded parameters for each speaker r to

form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)

Eigenvoice Eigenvoice space space

constructionconstruction

Speaker 1 Data

SI HMM

Speaker R Data

Model Training Model Training

Speaker 1 HMM Speaker R HMM

D = (Mn)times1 Principal Component

Analysis

Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace

( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=

SI HMM model

ML-29

PCA Examples Eigenvoice (23)

bull Example 4 Eigenvoice in speaker adaptation (cont)

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 24: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-24

PCA Examples Eigenface (14)

bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)

ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images

ndash Stepsbull Convert each of the L reference images into a vector of

floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these

reference vectors bull Apply Principal Component Analysis (PCA) find the

eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are

called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=

nL

L

L

L

nn x

xx

x

xx

x

xx

2

1

2

22

12

2

1

21

11

1

xxx

ML-25

PCA Examples Eigenface (24)

bull Example 4 Eigenface in face recognition (cont) ndash Steps

bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces

ndash The Eigenface approach persists the minimum mean-squared error criterion

ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces

( ) ( ) ( )[ ]Kiiii

Kiiii

wwwKwww

21

21

121ˆ

=rArr

++++=

yeeexx

Feature vector of a person i

ML-26

PCA Examples Eigenface (34)

The averaged face

Face images as the training set

ML-27

PCA Examples Eigenface (44)

A projected Face imageSeven eigenfaces derived from the training set

(Indicate directions of variations between faces )

ML-28

PCA Examples Eigenvoice (13)

bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)

ndash Stepsbull Concatenating the regarded parameters for each speaker r to

form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)

Eigenvoice Eigenvoice space space

constructionconstruction

Speaker 1 Data

SI HMM

Speaker R Data

Model Training Model Training

Speaker 1 HMM Speaker R HMM

D = (Mn)times1 Principal Component

Analysis

Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace

( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=

SI HMM model

ML-29

PCA Examples Eigenvoice (23)

bull Example 4 Eigenvoice in speaker adaptation (cont)

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 25: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-25

PCA Examples Eigenface (24)

bull Example 4 Eigenface in face recognition (cont) ndash Steps

bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces

ndash The Eigenface approach persists the minimum mean-squared error criterion

ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces

( ) ( ) ( )[ ]Kiiii

Kiiii

wwwKwww

21

21

121ˆ

=rArr

++++=

yeeexx

Feature vector of a person i

ML-26

PCA Examples Eigenface (34)

The averaged face

Face images as the training set

ML-27

PCA Examples Eigenface (44)

A projected Face imageSeven eigenfaces derived from the training set

(Indicate directions of variations between faces )

ML-28

PCA Examples Eigenvoice (13)

bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)

ndash Stepsbull Concatenating the regarded parameters for each speaker r to

form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)

Eigenvoice Eigenvoice space space

constructionconstruction

Speaker 1 Data

SI HMM

Speaker R Data

Model Training Model Training

Speaker 1 HMM Speaker R HMM

D = (Mn)times1 Principal Component

Analysis

Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace

( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=

SI HMM model

ML-29

PCA Examples Eigenvoice (23)

bull Example 4 Eigenvoice in speaker adaptation (cont)

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 26: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-26

PCA Examples Eigenface (34)

The averaged face

Face images as the training set

ML-27

PCA Examples Eigenface (44)

A projected Face imageSeven eigenfaces derived from the training set

(Indicate directions of variations between faces )

ML-28

PCA Examples Eigenvoice (13)

bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)

ndash Stepsbull Concatenating the regarded parameters for each speaker r to

form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)

Eigenvoice Eigenvoice space space

constructionconstruction

Speaker 1 Data

SI HMM

Speaker R Data

Model Training Model Training

Speaker 1 HMM Speaker R HMM

D = (Mn)times1 Principal Component

Analysis

Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace

( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=

SI HMM model

ML-29

PCA Examples Eigenvoice (23)

bull Example 4 Eigenvoice in speaker adaptation (cont)

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 27: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-27

PCA Examples Eigenface (44)

A projected Face imageSeven eigenfaces derived from the training set

(Indicate directions of variations between faces )

ML-28

PCA Examples Eigenvoice (13)

bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)

ndash Stepsbull Concatenating the regarded parameters for each speaker r to

form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)

Eigenvoice Eigenvoice space space

constructionconstruction

Speaker 1 Data

SI HMM

Speaker R Data

Model Training Model Training

Speaker 1 HMM Speaker R HMM

D = (Mn)times1 Principal Component

Analysis

Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace

( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=

SI HMM model

ML-29

PCA Examples Eigenvoice (23)

bull Example 4 Eigenvoice in speaker adaptation (cont)

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 28: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-28

PCA Examples Eigenvoice (13)

bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)

ndash Stepsbull Concatenating the regarded parameters for each speaker r to

form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)

Eigenvoice Eigenvoice space space

constructionconstruction

Speaker 1 Data

SI HMM

Speaker R Data

Model Training Model Training

Speaker 1 HMM Speaker R HMM

D = (Mn)times1 Principal Component

Analysis

Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace

( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=

SI HMM model

ML-29

PCA Examples Eigenvoice (23)

bull Example 4 Eigenvoice in speaker adaptation (cont)

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 29: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-29

PCA Examples Eigenvoice (23)

bull Example 4 Eigenvoice in speaker adaptation (cont)

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 30: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-30

PCA Examples Eigenvoice (33)

bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)

bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)

bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)

bull Correlate with second-formantmovement

Note thatEigenface performs on feature spacewhile eigenvoice performs on model space

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 31: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-31

Linear Discriminant Analysis (LDA) (12)

bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear

Discriminant Analysisbull Fisher (1936) introduced it for two-class classification

bull Rao (1965) extended it to handle multiple-class classification

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 32: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-32

Linear Discriminant Analysis (LDA) (22)

bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal

Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 33: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-33

LDA Derivations (14)

bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is

ndash The class sample means are

ndash The class sample covariances are

ndash The average within-class variation before transform

ndash The average between-class variation before transform

ix

( ) ( ) index class is 21 sdotisin= gJjjg ix

sum==

N

iiN 1

1 xx

( )sum

==

jgi

jj

iN xxx 1

( )( )( )sum

=minusminus=

jg

Tjiji

jj

iN xxxxxΣ 1

sum=j

jjw NN

ΣS 1

( )( )sum minusminus=j

Tjjjb N

NxxxxS 1

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 34: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-34

LDA Derivations (24)

bull If the transform is appliedndash The sample vectors will be

ndash The sample mean will be

ndash The class sample means will be

ndash The average within-class variation will be

[ ]mwwwW 1 2=

iT

i xWy =

xWxWxWy TN

ii

TN

ii

T

NN=⎟

⎠⎞

⎜⎝⎛== sumsum

== 11

11

( ) jT

jgi

T

jj

iNxWxWy

x== sum

=

1

( )( )( )

( )( )

WSW

WΣW

xWxWxWxWSxx x

wT

jj

jT

T

jgi

T

ji

T

jg jgi

T

ji

T

jjjw

NN

NNNN

N ii i

=⎭⎬⎫

⎩⎨⎧=

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛minus⎟

⎟⎠

⎞⎜⎜⎝

⎛minussdot=

sum

sumsum sumsum== =

1

1111~

x y

wS wS~

bS bS~

W

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 35: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-35

LDA Derivations (34)

bull If the transform is appliedndash Similarly the average between-class variation will be

ndash Try to find optimal such that the following objective function is maximized

bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the

largest eigenvalues in

bull That is are the eigenvectors corresponding to thelargest eigenvalues of

[ ]mwwwW 1 2=

WSWS bT

b =~

W

( )WSW

WSW

S

SW

wT

bT

w

bJ == ~

~

iwiib wSwS λ=W

siw

iiibw wwSS λ=minus1bw SS 1minus

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 36: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-36

LDA Derivations (44)

bull Proof ( )

( ) ( )( )

( )( )

( )( )

iiib

iwiibiwiib

iwT

ibT

iiiw

Tiw

iwT

ib

iwT

ibT

iw

iwT

iwT

ib

iwT

ibT

iwiwT

ib

i

i

iwT

ibT

i

i

wT

bT

w

b

w

i

i

ii

i

i

i

i

i

ii

i

i

J

wwSSwSwSwSwS

wSwwSw

wSwwS

wSwwS

wSw

wSwwS

wSw

wSwwS

wSw

wSwwSwSwwSw

wSwwSw

Ww

WSW

WSW

S

SWW

WWW

λ

λλ

λλ

λ

λ

=rArr

=rArr=minusrArr

⎟⎟⎠

⎞⎜⎜⎝

⎛==minus

=minusrArr

=minus

=partpart

rArr

=

===

minus1

22

2

ˆˆˆ

0

0

0

022

solution optimal has form qradtic The

thatfind want towe of tor column veceach for Or

maxarg~

~maxargmaxargˆ

Q

Q

2GFGGF

GF primeminusprime

=prime⎟⎠⎞

⎜⎝⎛

( )xCCxCxx T

T

+=d

d )(

determinant

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 37: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-37

LDA Examples Feature Transformation (12)

bull Example1 Experiments on Speech Signal Processing

Covariance Matrix of the 18-Mel-filter-bank vectors

Calculated using Year-99rsquos 5471 files

Covariance Matrix of the 18-cepstral vectors

Calculated using Year-99rsquos 5471 files

After Cosine Transform

( )( )sum minusminus=i

TiiN x

xxxxΣ 1 ( )( )sum minusminus=primei

TiiN y

yyyyΣ 1

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 38: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-38

LDA Examples Feature Transformation (22)

Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors

Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files

20112311LDA-2

20172312LDA-1

22712632MFCC

WGTC

Character Error Rate

bull Example1 Experiments on Speech Signal Processing (cont)

After PCA Transform After LDA Transform

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 39: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-39

PCA vs LDA (12)

PCA LDA

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 40: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-40

Heteroscedastic Discriminant Analysis (HDA)

bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and

HDA for 2-class case

ndash Clearly the HDA provides a much lower classification error thanLDA theoretically

bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 41: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-41

HW-3 Feature Transformation (14)

bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for

the merged data set2 Apply PCA and LDA transformations to the merged data set

respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed

3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 42: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-42

HW-3 Feature Transformation (24)

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 43: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-43

HW-3 Feature Transformation (34)

bull Plot Covariance Matrix

bull Eigen Decomposition

CoVar=[30 05 0409 63 0204 04 42

]colormap(default)surf(CoVar)

BE=[30 35 1419 63 2224 04 42

]

WI=[40 41 2129 87 3544 32 43

]

LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )

[VD]=eig(A)[VD]=eigs(A3)

fid=fopen(Basisw)for i=13 feature vector length

for j=13 basis numberfprintf(fid1010f V(ij))

endfprintf(fidn)

end fclose(fid)

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 44: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-44

HW-3 Feature Transformation (44)

bull Examples

-10 -8 -6 -4 -2 0 2 4-2

-1

0

1

2

3

4

5

6

feature 1

feat

ure

2

2000筆原始資料經 PCA轉換後分布圖

-09 -08 -07 -06 -05 -04 -03 -02-12

-1

-08

-06

-04

-02

0

02

feature 1

feat

ure

2

2000筆原始資料經LDA轉換後分布圖

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 45: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-45

Latent Semantic Analysis (LSA) (17)

bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)

bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the

same dimensions

ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms

ndash Dimensions of the reduced space correspond to the axes of greatest variation

bull Closely related to Principal Component Analysis (PCA)

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 46: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-46

LSA (27)

bull Dimension Reduction and Feature Extractionndash PCA

ndash SVD (in LSA)

Xφ Tiiy =

Y

knX

kφ1φ

sum=

k

iiiy

1

φ

kφ1φ

nX

rxr

Σrsquo

r le min(mn)rxn

VrsquoTUrsquo

mxrmxn mxn

kxkA Arsquo

kgiven afor ˆmin2

XX minus

kF given afor min 2AA minusprime

feature space

latent semanticspace

latent semanticspace

k

k

orthonormal basis

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 47: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-47

LSA (37)

ndash Singular Value Decomposition (SVD) used for the word-document matrix

bull A least-squares method for dimension reduction

iφx

1ϕ2ϕ

1 where

cos

1

111 1

1

=

===

φ

xφxx

xx TT

y ϕϕ

θ

1y2y

Projection of a Vector x

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 48: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-48

LSA (47)

bull Frameworks to circumvent vocabulary mismatch

Doc

Query

terms

terms

doc expansion

query expansion

literal term matching

structure model

structure model

latent semanticstructure retrieval

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 49: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-49

LSA (57)

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 50: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-50

LSA (67)

Query ldquohuman computer interactionrdquo

An OOV word

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 51: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-51

LSA (77)

bull Singular Value Decomposition (SVD)

=

w1w2

wm

d1 d2 dn

mxn

rxr

mxr

rxnA Umxr

Σr VTrxm

d1 d2 dn

w1w2

wm

=

w1w2

wm

d1 d2 dn

mxn

kxk

mxk

kxnArsquo Ursquomxk

Σk VrsquoTkxm

d1 d2 dn

w1w2

wm

Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk

VTV=IrXr

Both U and V has orthonormalcolumn vectors

UTU=IrXr

K le r ||A||F2 ge ||Arsquo|| F2

r le min(mn)

sumsum= =

=m

i

n

jijFaA

1 1

22

Row A Rn

Col A Rmisinisin

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 52: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-52

LSA Derivations (17)

bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix

bull All eigenvalues λj are nonnegative real numbers

bull All eigenvectors vj are orthonormal ( Rn)

bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn

021 gegegege nλλλ

njjj 1 == λσ[ ]nnn vvvV 21=times

1=jT

jvv ( )nxn

T IVV =

( )ndiag λλλ 112 =Σ

22

11

Av

Av

=

=

σ

σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A

ii

iiiTii

TTii

Av

vvAvAvAv

σ

λλ

=rArr

===2

sigma

isin

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 53: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-53

LSA Derivations (27)

bull Av1 Av2 hellip Avr is an orthogonal basis of Col A

ndash Suppose that A (or ATA) has rank r le n

ndash Define an orthonormal basis u1 u2 hellip ur for Col A

bull Extend to an orthonormal basis u1 u2 hellip um of Rm

( ) 0====bull jT

ijjTT

ijT

iji vvAvAvAvAvAvAv λ

0 0 2121 ====gtgegege ++ nrrr λλλλλλ

[ ] [ ]rrr

iiiii

ii

i

vvvAuuu

AvuAvAvAv

u

11

2121 =ΣrArr

=rArr== σσ

[ ] [ ]

T

TTnrmr

VUA

AVVVUAVU

vvvvAuuuu

Σ=rArr

=ΣrArr=ΣrArr

=ΣrArr

2121

222

21

2 rFA σσσ +++=

sumsum= =

=m

i

n

jijFaA

1 1

22

V an orthonormal matrix (nxr)

nxnI

Known in advance

( )

( ) ( ) ( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

minustimesminustimesminus

minustimestimes

rnrmrrm

rnrrnm 00

0Σ Σ

u also an orthonormal matrix

(mxr)

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 54: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-54

LSA Derivations (37)

mxn

A

2 V

1 V 1 U

2 U

( )

AVAV

VΣU

VV

000Σ

UUVUΣ

==

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

11

111

2

1121

T

T

T

TT0 =AX

Av

of space row thespans i

Ti

A

u

of space row

thespans

Rn Rm

U TV

AVU =Σ

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 55: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-55

LSA Derivations (47)

bull Additional Explanationsndash Each row of is related to the projection of a corresponding

row of onto the basis formed by columns of

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

ndash Each row of is related to the projection of a corresponding row of onto the basis formed by

bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of

UA V

AVUUVVUAV

VUAT

T

=ΣrArrΣ=Σ=rArr

Σ=

UA

V

V

UTA

( )UAV

VUUVUVUUA

VUA

T

TTTT

T

=ΣrArr

Σ=Σ=Σ=rArr

Σ=

TAV

U

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 56: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-56

LSA Derivations (57)

bull Fundamental comparisons based on SVDndash The original word-document matrix (A)

ndash The new word-document matrix (Arsquo)bull compare two terms

rarr dot product of two rows of UrsquoΣrsquobull compare two docs

rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo

w1w2

wm

d1 d2 dn

mxn

A

bull compare two terms rarr dot product of two rows of Andash or an entry in AAT

bull compare two docs rarr dot product of two columns of Andash or an entry in ATA

bull compare a term and a doc rarr each individual entry of A

ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T

ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T

For stretching or shrinking

Irxr

Ursquo=Umxk

Σrsquo=Σk

Vrsquo=Vnxk

wjwi

dk

ds

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 57: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-57

LSA Derivations (67)

bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the

original analysisbull Fold-in a new mx1 query (or doc) vector

ndash Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

row vectors

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 58: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-58

LSA Derivations (77)

bull Fold-in a new 1 X n term vector 1

11ˆminustimestimestimestimes Σ= kkknnk Vtt

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 59: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-59

LSA Example

bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels

Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 60: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-60

LSA Conclusions

bull Advantagesndash A clean formal framework and a clearly defined optimization

criterion (least-squares)bull Conceptual simplicity and clarity

ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)

ndash Good results for high-recall searchbull Take term co-occurrence into account

bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy

bull Eg bank basshellip

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 61: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-61

LSA Toolkit SVDLIBC (15)

bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library

bull Download it at httptedlabmitedu~dr

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 62: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-62

LSA Toolkit SVDLIBC (25)

bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs

ndash Each entry is weighted by TFxIDF score

bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space

bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality

23 00 42

00 13 22

38 00 05

00 00 00

Term

Doc4 3 6

2

0 23

2 38

11 1330 421 222 05

RowTem

Col Doc

Nonzero entries

2 nonzero entries at Col 0

Col 0 Row 0 Col 0 Row 2

1 nonzero entryat Col 1

Col 1 Row 1 3 nonzero entry

at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 63: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-63

LSA Toolkit SVDLIBC (35)

bull Example term-docmatrix

bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix

51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip

Indexing Term no Doc no Nonzero

entries

sparse matrix input prefix of output filesNo of reserved

eigenvectors name of sparse

matrix input

LSA100-Ut

LSA100-S

LSA100-Vt

output

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 64: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-64

LSA Toolkit SVDLIBC (45)

bull LSA100-Ut

bull LSA100-S

100 512530003 0001 helliphellip0002 0002 helliphellip

word vector (uT) 1x100

51253 words

10026861882994155959hellip

100 eigenvalues

bull LSA100-Vt

100 22650021 0035 helliphellip0012 0022 helliphellip

doc vector (vT) 1x100

2265 docs

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice

Page 65: Discriminative Feature Extraction and Dimension Reductionberlin.csie.ntnu.edu.tw/Courses/2006S-Machine... · Introduction to Machine Learning, Chapter 6 2. Data Mining: Concepts,

ML-65

LSA Toolkit SVDLIBC (55)

bull Fold-in a new mx1 query vector

bull Cosine measure between the query and doc vectors in the latent semantic space

( ) 111ˆ minus

timestimestimestimes Σ= kkkmmT

k UqqQuery represented by the weightedsum of it constituent term vectors

The separate dimensions are differentially weighted

Just like a row of V

( )ΣΣ

Σ=ΣΣ=

dq

dqdqcoinedqsimT

ˆˆ

ˆˆ)ˆˆ(ˆˆ

2

TFxIDF weighted beforehand

ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice