machine learning for data science (cs4786) lecture 3 · 2016. 2. 4. · dimensionalityreduction...
TRANSCRIPT
-
Machine Learning for Data Science (CS4786)Lecture 3
Principal Component Analysis
Course Webpage :http://www.cs.cornell.edu/Courses/cs4786/2016sp/
-
ANNOUNCEMENTS
Waitlist size currently about 55 :(
-
DIMENSIONALITY REDUCTION
Given feature vectors x1, . . . ,xn ∈ Rd, compress the data points intolow dimensional representation y1, . . . ,yn ∈ RK where K 1
x
>n y
>n
y>1
-
DIMENSIONALITY REDUCTION
Given feature vectors x1, . . . ,xn ∈ Rd, compress the data points intolow dimensional representation y1, . . . ,yn ∈ RK where K
-
PCA: VARIANCE MAXIMIZATION
Pick directions along which data varies the mostFirst principal component:
w1 = arg maxw∶�w�2=1
1n
n�t=1�w�x
t
− 1n
n�t=1
w
�x
t
�2
= arg maxw∶�w1�2=1
1n
n�t=1�w
�(xt
− µ)�2= arg max
w∶�w�2=1 w�⌃w
⌃ is the covariance matrix
Writing down Lagrangian and optimizing, w⌃ = �w1n
n�t=1
w
�⌃w = �2
-1.5 -1 -0.5 0 0.5 1 1.5-1.2
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
First principal direction = Top eigen vector
-
PRINCIPAL COMPONENT ANALYSIS
Eigenvectors of the covariance matrix are the principalcomponents
Top K principal components are the eigenvectors with K largesteigenvalues
Projection = Data × Top KeigenvectorsReconstruction = Projection × Transpose of top K eigenvectorsIndependently discovered by Pearson in 1901 and Hotelling in1933.
⌃ =cov X !
1.
eigs= ⌃ ,K( )W2.
3. Y = W⇥X�µ
-
A PICTURE
y4[1]
0
w
y2[1] = x>1 w = kx2kcos(\xw)
-
ORTHONORMAL PROJECTIONS
Think of W1, . . . ,WK as coordinate system for PCA
y values provide coefficients in this system
Without loss of generality, W1, . . . ,WK can be orthonormal, i.e.W
i
⊥Wj
& �Wi
� = 1.Reconstruction:
x̂
t
= y�t
W
� + µ
-
PCA: VARIANCE MAXIMIZATION
How do we find the remaining components?
We are looking for orthogonal directions.
Start with the d dimensional space
While we haven’t yet found K directions,Find first principal component directionRemove this direction and consider data points in the remainingsubspace after projecting to first component
End
This solutions is given by W = Top K eigenvectors of ⌃
-
PCA: VARIANCE MAXIMIZATION
How do we find the remaining components?
We are looking for orthogonal directions.
Start with the d dimensional space
While we haven’t yet found K directions,Find first principal component directionRemove this direction and consider data points in the remainingsubspace after projecting to first component
End
This solutions is given by W = Top K eigenvectors of ⌃
-
PCA: VARIANCE MAXIMIZATION
How do we find the remaining components?
We are looking for orthogonal directions.
Start with the d dimensional space
While we haven’t yet found K directions,Find first principal component directionRemove this direction and consider data points in the remainingsubspace after projecting to first component
End
This solutions is given by W = Top K eigenvectors of ⌃
-
PCA: VARIANCE MAXIMIZATION
Covariance matrix:
⌃ = 1n
n�t=1(x
t
− µ)(xt
− µ)�
Its a d × d matrix, ⌃[i, j]measures “covariance” of features i and jRecall cov(A,B) = E[(A −E[A])(B −E[B])]Alternatively,
⌃[i, j] = 1n
�����������
x1[i] − µ[i]⋅⋅⋅x
n
[i] − µ[i]
�����������
� �����������
x1[j] − µ[j]⋅⋅⋅x
n
[j] − µ[j]
�����������Inner products measure similarity.
-
PCA: MINIMIZING RECONSTRUCTION ERROR
Goal: find the basis that minimizes reconstruction error,
n�t=1�x̂
t
− xt
�22 = n�t=1������������
k�j=1
y
t
[j]wj
+ µ − xt
������������2
2
= n�t=1������������
k�j=1
y
t
[j]wj
+ µ − d�j=1
y
t
[j]wj
− µ������������
2
2
= n�t=1������������
d�j=k+1
y
t
[j]wj
������������2
2
(note that yt
[j] =w�j
(xt
− µ))
= n�t=1������������
d�j=k+1(w�
j
(xt
− µ))wj
������������2
2
= n�t=1
d�j=k+1�w�
j
(xt
− µ)�2 = n�t=1
d�j=k+1
w
�j
(xt
− µ)(xt
− µ)�wj
-
PCA: MINIMIZING RECONSTRUCTION ERROR
Goal: find the basis that minimizes reconstruction error,
n�t=1�x̂
t
− xt
�22 = n�t=1������������
k�j=1
y
t
[j]wj
+ µ − xt
������������2
2
= n�t=1������������
k�j=1
y
t
[j]wj
+ µ − d�j=1
y
t
[j]wj
− µ������������
2
2
= n�t=1������������
d�j=k+1
y
t
[j]wj
������������2
2
(note that yt
[j] =w�j
(xt
− µ))
= n�t=1������������
d�j=k+1(w�
j
(xt
− µ))wj
������������2
2
= n�t=1
d�j=k+1�w�
j
(xt
− µ)�2 = n�t=1
d�j=k+1
w
�j
(xt
− µ)(xt
− µ)�wj
-
PCA: MINIMIZING RECONSTRUCTION ERROR
Goal: find the basis that minimizes reconstruction error,
1n
n�t=1�x̂
t
− xt
�22 = 1n
n�t=1
d�j=k+1
w
�j
(xt
− µ)(xt
− µ)�wj
= d�j=k+1
w
�j
⌃wj
Minimize w.r.t. w’s that are orthonormal,
argmin∀j, �w
j
�2=1d�
j=k+1w
�j
⌃wj
Using Lagrangian multipliers, there exists �k+1, . . . ,�d such that
solution to above is given by:
minimizen�
t=1d�
j=k+1w
�j
⌃wj
+ d�j=k+1
�j
�w
j
�22Setting derivate to 0, ⌃w
j
= �j
w
j
. That is wj
’s are eigenvectorsand �
j
’s are eigenvalues.
-
PCA: MINIMIZING RECONSTRUCTION ERROR
Goal: find the basis that minimizes reconstruction error,
1n
n�t=1�x̂
t
− xt
�22 = 1n
n�t=1
d�j=k+1
w
�j
(xt
− µ)(xt
− µ)�wj
= d�j=k+1
w
�j
⌃wj
Minimize w.r.t. w’s that are orthonormal,
argmin∀j, �w
j
�2=1d�
j=k+1w
�j
⌃wj
Using Lagrangian multipliers, there exists �k+1, . . . ,�d such that
solution to above is given by:
minimizen�
t=1d�
j=k+1w
�j
⌃wj
+ d�j=k+1
�j
�w
j
�22Setting derivate to 0, ⌃w
j
= �j
w
j
. That is wj
’s are eigenvectorsand �
j
’s are eigenvalues.
-
PCA: MINIMIZING RECONSTRUCTION ERROR
Solution : wj
’s are eigenvectors and �j
’s are correspondingeigenvaluesFurther, reconstruction error can be written as:
argminw∶�w
j
�2=1d�
j=k+1w
�j
⌃wj
= d�j=k+1
�j
w
�j
w
j
= d�j=k+1
�j
Clearly to minimize reconstruction error, we need to minimize∑dj=k+1 �j. In other words we discard the d − k directions that have
the smallest eigenvalue
-
PRINCIPAL COMPONENT ANALYSIS
Eigenvectors of the covariance matrix are the principalcomponents
Top K principal components are the eigenvectors with K largesteigenvalues
Projection = Data × Top KeigenvectorsReconstruction = Projection × Transpose of top K eigenvectorsIndependently discovered by Pearson in 1901 and Hotelling in1933.
⌃ =cov X !
1.
eigs= ⌃ ,K( )W2.
3. Y = W⇥X�µ
-
RECONSTRUCTION
4.Y= ⇥bX W> +µ
-
WHEN d >> n
If d >> n then ⌃ is largeBut we only need top K eigen vectors.Idea: use SVD
X − µ = UDV�Then note that, ⌃ = (X − µ)(X − µ)� = UD2U
Hence, matrix U is the same as matrix W got from eigendecomposition of ⌃, eigenvalues are diagonal elements of D2
Alternative algorithm:
W = SVD(X − µ,K)
-
PRINCIPAL COMPONENT ANALYSIS: DEMO