chapter 6: dimensionality reduction author: christoph eick the material is mostly based on the...
TRANSCRIPT
CHAPTER 6:
Dimensionality Reduction
Author: Christoph EickThe material is mostly based on the Shlens PCATutorial http://www2.cs.uh.edu/~ceick/ML/pca.pdf and to a lesser extend based on material in the Alpaydin book.
2
Why Reduce Dimensionality?
1. Reduces time complexity: Less computation2. Reduces space complexity: Less parameters3. Saves the cost of aquiring the feature4. Simpler models are more robust5. Easier to interpret; simpler explanation6. Data visualization (structure, groups, outliers, etc)
if plotted in 2 or 3 dimensions
Ch. Eick: Dimensionality Reduction
3
Feature Selection/Extraction/Construction
Feature selection: Choosing k<d important features, ignoring the remaining d – k
Subset selection algorithms Feature extraction: Project the
original xi , i =1,...,d dimensions to
new k<d dimensions, zj , j =1,...,k
Principal components analysis (PCA), linear discriminant analysis (LDA), factor analysis
(FA) Feature construction: create new features based
on old features: f=(…) with f usually being a non-linear functionsupport vector machines,…
Ch. Eick: Dimensionality Reduction
4
Key Ideas Dimensionality Reduction
Given a dataset X Find a low-dimensional linear projection Two possible formulations
The variance in low-d is maximized The average projection cost is minimized
Both are equivalent
Ch. Eick: Dimensionality Reduction
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)5
Principal Components Analysis (PCA) Find a low-dimensional space such that when x is
projected there, information loss is minimized. The projection of x on the direction of w is: z =
wTx Find w such that Var(z) capture is maximized
Var(z) = Var(wTx) = E[(wTx – wTμ)2] = E[(wTx – wTμ)(wTx – wTμ)]= E[wT(x – μ)(x – μ)Tw]
= wT E[(x – μ)(x –μ)T]w = wT ∑ w
where Var(x)= E[(x – μ)(x –μ)T] = ∑
Question: Why does PCA maximize and not minimize the variance in z?
Clarifications Assume the dataset x is d-dimensional with n
examples and we want to reduce it to a k-dimensional dataset z :
x= {(…)n wT= {(…)d
(…) (…)d} kxd
(…)n} dn
z= wTx kxn (you take scalar products of the elements in x with w obtaining a k-dimensional dataset)
Remarks: w contains the k eigenvectors of the co-variance
matrix of x with the highest eigenvalues: wi=iwi
k is usually chosen based on the variance captured/largeness of the first k eigenvalues.
6Ch. Eick: Dimensionality Reductionhttp://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors
k Eigenvectors
Corrected on 2/24/2011
Examplex= {(…)n wT= {(…)d
(…) (…)d} kxd
(…)n} dn
z= wTx kxn (you take scalar products of the elements in x with w obtaining a k-dimensional dataset)
Example: 4d dataset which contains 5 examples1 2 0 1 4 0.5 0.5 -1.0 0.50 0 2 -1 2 1.0 0.0 1.0 1.01 1 1 1 10 -1 0 3 0 x wT
7Ch. Eick: Dimensionality Reduction
Corrected on 2/24/2011
(z1,z2):= (0.5*a1+0.5*a2-a3+0.5*a4, a1+a2+a4)
Shlens Tutorial on PCA PCA most valuable result of applied linear algebra
(other PageRank) “The goal of PCA is to compute the most
meaningful basis to re-express a noisy dataset. The hope is that the new basis will filter out noise and reveal hidden structure”.
The goal of PCA is deciphering “garbled” data, referring to: rotation, redundancy, and noise.
PCA is a non-parametric method; no way to incorporate preferences and other choices
8Ch. Eick: Dimensionality Reduction
Computing Principal Components as Eigenvectors of the Covariance Matrix
1. Normalize x by subtracting from each attribute value its mean, obtaining y.
2. Compute 1/(n-1)*yyT= the covariance matrix of x. 3. Diagonalize obtaining a set of eigenvectors e with:
eeT = iI (i is the eigenvalue of the ith eigenvector)
4. Select how many and which eigenvectors in e to keep, obtaining w (based on variance expressed/largeness of eigenvalues and possibly other criteria)
5. Create your transformed dataset z= wTx
Remark: Symmetric matrices are always orthogonally diagonalizable see proof page 11 of Shlens paper!
9Ch. Eick: Dimensionality Reduction
10
Maximize Var(z) subject to ||w||=1
∑w1 = αw1 that is, w1 is an eigenvector of ∑
Choose the one principal component with the largest eigenvalue for Var(z)
Second principal component: Max Var(z2), s.t., ||w2||=1 and orthogonal to w1
∑ w2 = α w2 that is, w2 is another eigenvector of ∑
and so on.
1max 11111
wwwww
TT
01max 1222222
wwwwwww
TTT
Textbook’s PCA Version
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)11
What PCA doesz = WT(x – m)
where the columns of W are the eigenvectors of ∑, and m is sample meanCenters the data at the origin and rotates the axes
http://www.youtube.com/watch?v=BfTMmoDFXyE
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)12
How to choose k ?
Proportion of Variance (PoV) explained
when λi are sorted in descending order Typically, stop at PoV>0.9 Scree graph plots of PoV vs k, stop at “elbow”
dk
k
21
21
14
SVD
MaximizingVariance
Eigenvectorsof S
Rome—Principal
Components
All Roads Lead to Rome!
http://voices.yahoo.com/historical-origin-all-roads-lead-rome-5443183.html?cat=37
Ch. Eick: Dimensionality Reduction
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)15
Visualizing Numbers after applying PCA
16
Multidimensional Scaling Given pairwise distances between N points,
dij, i,j =1,...,N
place on a low-dim map s.t. distances are preserved.
z = g (x | θ ) Find θ that min Sammon stress
s,r sr
srsr
s,r sr
srsr
E
2
2
2
2
||
|
xx
xxxgxg
xx
xxzzX
L1-Norm: http://en.wikipedia.org/wiki/Taxicab_geometry Lq-NormL1-Norm: http://en.wikipedia.org/wiki/Lp_space
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)17
Map of Europe by MDS
Map from CIA – The World Factbook: http://www.cia.gov/
http://forrest.psych.unc.edu/teaching/p208a/mds/mds.html