machine learning for data science (cs4786) lecture 3 · 2016. 2. 4. · dimensionalityreduction...

Machine Learning for Data Science (CS4786)Lecture 3

Principal Component Analysis

Course Webpage :http://www.cs.cornell.edu/Courses/cs4786/2016sp/

ANNOUNCEMENTS

Waitlist size currently about 55 :(

DIMENSIONALITY REDUCTION

Given feature vectors x1, . . . ,xn ∈ Rd, compress the data points intolow dimensional representation y1, . . . ,yn ∈ RK where K 1

x

>n y

>n

y>1

DIMENSIONALITY REDUCTION

Given feature vectors x1, . . . ,xn ∈ Rd, compress the data points intolow dimensional representation y1, . . . ,yn ∈ RK where K

PCA: VARIANCE MAXIMIZATION

Pick directions along which data varies the mostFirst principal component:

w1 = arg maxw∶�w�2=1

1n

n�t=1�w�x

t

− 1n

n�t=1

w

�x

t

�2

= arg maxw∶�w1�2=1

1n

n�t=1�w

�(xt

− µ)�2= arg max

w∶�w�2=1 w�⌃w

⌃ is the covariance matrix

Writing down Lagrangian and optimizing, w⌃ = �w1n

n�t=1

w

�⌃w = �2

-1.5 -1 -0.5 0 0.5 1 1.5-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

First principal direction = Top eigen vector

PRINCIPAL COMPONENT ANALYSIS

Eigenvectors of the covariance matrix are the principalcomponents

Top K principal components are the eigenvectors with K largesteigenvalues

Projection = Data × Top KeigenvectorsReconstruction = Projection × Transpose of top K eigenvectorsIndependently discovered by Pearson in 1901 and Hotelling in1933.

⌃ =cov X !

1.

eigs= ⌃ ,K( )W2.

3. Y = W⇥X�µ

A PICTURE

y4[1]

0

w

y2[1] = x>1 w = kx2kcos(\xw)

ORTHONORMAL PROJECTIONS

Think of W1, . . . ,WK as coordinate system for PCA

y values provide coefficients in this system

Without loss of generality, W1, . . . ,WK can be orthonormal, i.e.W

i

⊥Wj

& �Wi

� = 1.Reconstruction:

x̂

t

= y�t

W

� + µ


How do we find the remaining components?

We are looking for orthogonal directions.

Start with the d dimensional space

While we haven’t yet found K directions,Find first principal component directionRemove this direction and consider data points in the remainingsubspace after projecting to first component

End

This solutions is given by W = Top K eigenvectors of ⌃


Covariance matrix:

⌃ = 1n

n�t=1(x

t

− µ)(xt

− µ)�

Its a d × d matrix, ⌃[i, j]measures “covariance” of features i and jRecall cov(A,B) = E[(A −E[A])(B −E[B])]Alternatively,

⌃[i, j] = 1n

��

x1[i] − µ[i]⋅⋅⋅x

n

[i] − µ[i]

��

� ��

x1[j] − µ[j]⋅⋅⋅x

n

[j] − µ[j]

��Inner products measure similarity.

PCA: MINIMIZING RECONSTRUCTION ERROR

Goal: find the basis that minimizes reconstruction error,

n�t=1�x̂

t

− xt

�22 = n�t=1��

k�j=1

y

t

[j]wj

+ µ − xt

��2

2

= n�t=1��

k�j=1

y

t

[j]wj

+ µ − d�j=1

y

t

[j]wj

− µ��

2

2

= n�t=1��

d�j=k+1

y

t

[j]wj

��2

2

(note that yt

[j] =w�j

(xt

− µ))

= n�t=1��

d�j=k+1(w�

j

(xt

− µ))wj

��2

2

= n�t=1

d�j=k+1�w�

j

(xt

− µ)�2 = n�t=1

d�j=k+1

w

�j

(xt

− µ)(xt

− µ)�wj


Goal: find the basis that minimizes reconstruction error,

1n

n�t=1�x̂

t

− xt

�22 = 1n

n�t=1

d�j=k+1

w

�j

(xt

− µ)(xt

− µ)�wj

= d�j=k+1

w

�j

⌃wj

Minimize w.r.t. w’s that are orthonormal,

argmin∀j, �w

j

�2=1d�

j=k+1w

�j

⌃wj

Using Lagrangian multipliers, there exists �k+1, . . . ,�d such that

solution to above is given by:

minimizen�

t=1d�

j=k+1w

�j

⌃wj

+ d�j=k+1

�j

�w

j

�22Setting derivate to 0, ⌃w

j

= �j

w

j

. That is wj

’s are eigenvectorsand �

j

’s are eigenvalues.


Solution : wj

’s are eigenvectors and �j

’s are correspondingeigenvaluesFurther, reconstruction error can be written as:

argminw∶�w

j

�2=1d�

j=k+1w

�j

⌃wj

= d�j=k+1

�j

w

�j

w

j

= d�j=k+1

�j

Clearly to minimize reconstruction error, we need to minimize∑dj=k+1 �j. In other words we discard the d − k directions that have

the smallest eigenvalue

PRINCIPAL COMPONENT ANALYSIS

Eigenvectors of the covariance matrix are the principalcomponents

Top K principal components are the eigenvectors with K largesteigenvalues

Projection = Data × Top KeigenvectorsReconstruction = Projection × Transpose of top K eigenvectorsIndependently discovered by Pearson in 1901 and Hotelling in1933.

⌃ =cov X !

1.

eigs= ⌃ ,K( )W2.

3. Y = W⇥X�µ

RECONSTRUCTION

4.Y= ⇥bX W> +µ

WHEN d >> n

If d >> n then ⌃ is largeBut we only need top K eigen vectors.Idea: use SVD

X − µ = UDV�Then note that, ⌃ = (X − µ)(X − µ)� = UD2U

Hence, matrix U is the same as matrix W got from eigendecomposition of ⌃, eigenvalues are diagonal elements of D2

Alternative algorithm:

W = SVD(X − µ,K)

PRINCIPAL COMPONENT ANALYSIS: DEMO

machine learning for data science (cs4786) lecture 3 · 2016. 2. 4. · dimensionalityreduction...

Documents