chapter 6: dimensionality reduction author: christoph eick the material is mostly based on the...

CHAPTER 6:

Dimensionality Reduction

Author: Christoph EickThe material is mostly based on the Shlens PCATutorial http://www2.cs.uh.edu/~ceick/ML/pca.pdf and to a lesser extend based on material in the Alpaydin book.

http://www2.cs.uh.edu/~ceick/ML/pca.pdf

http://www2.cs.uh.edu/~ceick/ML/pca.pdf

2

Why Reduce Dimensionality?

1. Reduces time complexity: Less computation2. Reduces space complexity: Less parameters3. Saves the cost of aquiring the feature4. Simpler models are more robust5. Easier to interpret; simpler explanation6. Data visualization (structure, groups, outliers, etc)

if plotted in 2 or 3 dimensions

Ch. Eick: Dimensionality Reduction

3

Feature Selection/Extraction/Construction

Feature selection: Choosing k<d important features, ignoring the remaining d – k

Subset selection algorithms Feature extraction: Project the

original xi , i =1,...,d dimensions to

new k<d dimensions, zj , j =1,...,k

Principal components analysis (PCA), linear discriminant analysis (LDA), factor analysis

(FA) Feature construction: create new features based

on old features: f=(…) with f usually being a non-linear functionsupport vector machines,…


4

Key Ideas Dimensionality Reduction

Given a dataset X Find a low-dimensional linear projection Two possible formulations

The variance in low-d is maximized The average projection cost is minimized

Both are equivalent


Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)5

Principal Components Analysis (PCA) Find a low-dimensional space such that when x is

projected there, information loss is minimized. The projection of x on the direction of w is: z =

wTx Find w such that Var(z) capture is maximized

Var(z) = Var(wTx) = E[(wTx – wTμ)2] = E[(wTx – wTμ)(wTx – wTμ)]= E[wT(x – μ)(x – μ)Tw]

= wT E[(x – μ)(x –μ)T]w = wT ∑ w

where Var(x)= E[(x – μ)(x –μ)T] = ∑

Question: Why does PCA maximize and not minimize the variance in z?

Clarifications Assume the dataset x is d-dimensional with n

examples and we want to reduce it to a k-dimensional dataset z :

x= {(…)n wT= {(…)d

(…) (…)d} kxd

(…)n} dn

z= wTx kxn (you take scalar products of the elements in x with w obtaining a k-dimensional dataset)

Remarks: w contains the k eigenvectors of the co-variance

matrix of x with the highest eigenvalues: wi=iwi

k is usually chosen based on the variance captured/largeness of the first k eigenvalues.

6Ch. Eick: Dimensionality Reductionhttp://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors

k Eigenvectors

Corrected on 2/24/2011

http://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors

Examplex= {(…)n wT= {(…)d

(…) (…)d} kxd

(…)n} dn

z= wTx kxn (you take scalar products of the elements in x with w obtaining a k-dimensional dataset)

Example: 4d dataset which contains 5 examples1 2 0 1 4 0.5 0.5 -1.0 0.50 0 2 -1 2 1.0 0.0 1.0 1.01 1 1 1 10 -1 0 3 0 x wT

7Ch. Eick: Dimensionality Reduction

Corrected on 2/24/2011

(z1,z2):= (0.5*a1+0.5*a2-a3+0.5*a4, a1+a2+a4)

Shlens Tutorial on PCA PCA most valuable result of applied linear algebra

(other PageRank) “The goal of PCA is to compute the most

meaningful basis to re-express a noisy dataset. The hope is that the new basis will filter out noise and reveal hidden structure”.

The goal of PCA is deciphering “garbled” data, referring to: rotation, redundancy, and noise.

PCA is a non-parametric method; no way to incorporate preferences and other choices


Computing Principal Components as Eigenvectors of the Covariance Matrix

1. Normalize x by subtracting from each attribute value its mean, obtaining y.

2. Compute 1/(n-1)*yyT= the covariance matrix of x. 3. Diagonalize obtaining a set of eigenvectors e with:

eeT = iI (i is the eigenvalue of the ith eigenvector)

4. Select how many and which eigenvectors in e to keep, obtaining w (based on variance expressed/largeness of eigenvalues and possibly other criteria)

5. Create your transformed dataset z= wTx

Remark: Symmetric matrices are always orthogonally diagonalizable see proof page 11 of Shlens paper!


10

Maximize Var(z) subject to ||w||=1

∑w1 = αw1 that is, w1 is an eigenvector of ∑

Choose the one principal component with the largest eigenvalue for Var(z)

Second principal component: Max Var(z2), s.t., ||w2||=1 and orthogonal to w1

∑ w2 = α w2 that is, w2 is another eigenvector of ∑

and so on.

1max 11111

wwwww

TT

01max 1222222

wwwwwww

TTT

Textbook’s PCA Version


What PCA doesz = WT(x – m)

where the columns of W are the eigenvectors of ∑, and m is sample meanCenters the data at the origin and rotates the axes

http://www.youtube.com/watch?v=BfTMmoDFXyE

http://www.youtube.com/watch?v=BfTMmoDFXyE


How to choose k ?

Proportion of Variance (PoV) explained

when λi are sorted in descending order Typically, stop at PoV>0.9 Scree graph plots of PoV vs k, stop at “elbow”

dk

k

21

21

14

SVD

MaximizingVariance

Eigenvectorsof S

Rome—Principal

Components

All Roads Lead to Rome!

http://voices.yahoo.com/historical-origin-all-roads-lead-rome-5443183.html?cat=37




http://en.wikipedia.org/wiki/File:Colosseo_di_Roma_panoramic.jpg


Visualizing Numbers after applying PCA

16

Multidimensional Scaling Given pairwise distances between N points,

dij, i,j =1,...,N

place on a low-dim map s.t. distances are preserved.

z = g (x | θ ) Find θ that min Sammon stress

s,r sr

srsr

s,r sr

srsr

E

2

2

2

2

||

|

xx

xxxgxg

xx

xxzzX

L1-Norm: http://en.wikipedia.org/wiki/Taxicab_geometry Lq-NormL1-Norm: http://en.wikipedia.org/wiki/Lp_space

http://en.wikipedia.org/wiki/Taxicab_geometry

http://en.wikipedia.org/wiki/Lp_space


Map of Europe by MDS

Map from CIA – The World Factbook: http://www.cia.gov/

http://forrest.psych.unc.edu/teaching/p208a/mds/mds.html



chapter 6: dimensionality reduction author: christoph eick the material is mostly based on the...

Documents