machine learning for data science (cs4786) lecture 3 · 2016. 2. 4. · dimensionalityreduction...

21
Machine Learning for Data Science (CS4786) Lecture 3 Principal Component Analysis Course Webpage : http://www.cs.cornell.edu/Courses/cs4786/2016sp/

Upload: others

Post on 27-Jan-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

  • Machine Learning for Data Science (CS4786)Lecture 3

    Principal Component Analysis

    Course Webpage :http://www.cs.cornell.edu/Courses/cs4786/2016sp/

  • ANNOUNCEMENTS

    Waitlist size currently about 55 :(

  • DIMENSIONALITY REDUCTION

    Given feature vectors x1, . . . ,xn ∈ Rd, compress the data points intolow dimensional representation y1, . . . ,yn ∈ RK where K 1

    x

    >n y

    >n

    y>1

  • DIMENSIONALITY REDUCTION

    Given feature vectors x1, . . . ,xn ∈ Rd, compress the data points intolow dimensional representation y1, . . . ,yn ∈ RK where K

  • PCA: VARIANCE MAXIMIZATION

    Pick directions along which data varies the mostFirst principal component:

    w1 = arg maxw∶�w�2=1

    1n

    n�t=1�w�x

    t

    − 1n

    n�t=1

    w

    �x

    t

    �2

    = arg maxw∶�w1�2=1

    1n

    n�t=1�w

    �(xt

    − µ)�2= arg max

    w∶�w�2=1 w�⌃w

    ⌃ is the covariance matrix

    Writing down Lagrangian and optimizing, w⌃ = �w1n

    n�t=1

    w

    �⌃w = �2

    -1.5 -1 -0.5 0 0.5 1 1.5-1.2

    -1

    -0.8

    -0.6

    -0.4

    -0.2

    0

    0.2

    0.4

    0.6

    0.8

    First principal direction = Top eigen vector

  • PRINCIPAL COMPONENT ANALYSIS

    Eigenvectors of the covariance matrix are the principalcomponents

    Top K principal components are the eigenvectors with K largesteigenvalues

    Projection = Data × Top KeigenvectorsReconstruction = Projection × Transpose of top K eigenvectorsIndependently discovered by Pearson in 1901 and Hotelling in1933.

    ⌃ =cov X !

    1.

    eigs= ⌃ ,K( )W2.

    3. Y = W⇥X�µ

  • A PICTURE

    y4[1]

    0

    w

    y2[1] = x>1 w = kx2kcos(\xw)

  • ORTHONORMAL PROJECTIONS

    Think of W1, . . . ,WK as coordinate system for PCA

    y values provide coefficients in this system

    Without loss of generality, W1, . . . ,WK can be orthonormal, i.e.W

    i

    ⊥Wj

    & �Wi

    � = 1.Reconstruction:

    t

    = y�t

    W

    � + µ

  • PCA: VARIANCE MAXIMIZATION

    How do we find the remaining components?

    We are looking for orthogonal directions.

    Start with the d dimensional space

    While we haven’t yet found K directions,Find first principal component directionRemove this direction and consider data points in the remainingsubspace after projecting to first component

    End

    This solutions is given by W = Top K eigenvectors of ⌃

  • PCA: VARIANCE MAXIMIZATION

    How do we find the remaining components?

    We are looking for orthogonal directions.

    Start with the d dimensional space

    While we haven’t yet found K directions,Find first principal component directionRemove this direction and consider data points in the remainingsubspace after projecting to first component

    End

    This solutions is given by W = Top K eigenvectors of ⌃

  • PCA: VARIANCE MAXIMIZATION

    How do we find the remaining components?

    We are looking for orthogonal directions.

    Start with the d dimensional space

    While we haven’t yet found K directions,Find first principal component directionRemove this direction and consider data points in the remainingsubspace after projecting to first component

    End

    This solutions is given by W = Top K eigenvectors of ⌃

  • PCA: VARIANCE MAXIMIZATION

    Covariance matrix:

    ⌃ = 1n

    n�t=1(x

    t

    − µ)(xt

    − µ)�

    Its a d × d matrix, ⌃[i, j]measures “covariance” of features i and jRecall cov(A,B) = E[(A −E[A])(B −E[B])]Alternatively,

    ⌃[i, j] = 1n

    �����������

    x1[i] − µ[i]⋅⋅⋅x

    n

    [i] − µ[i]

    �����������

    � �����������

    x1[j] − µ[j]⋅⋅⋅x

    n

    [j] − µ[j]

    �����������Inner products measure similarity.

  • PCA: MINIMIZING RECONSTRUCTION ERROR

    Goal: find the basis that minimizes reconstruction error,

    n�t=1�x̂

    t

    − xt

    �22 = n�t=1������������

    k�j=1

    y

    t

    [j]wj

    + µ − xt

    ������������2

    2

    = n�t=1������������

    k�j=1

    y

    t

    [j]wj

    + µ − d�j=1

    y

    t

    [j]wj

    − µ������������

    2

    2

    = n�t=1������������

    d�j=k+1

    y

    t

    [j]wj

    ������������2

    2

    (note that yt

    [j] =w�j

    (xt

    − µ))

    = n�t=1������������

    d�j=k+1(w�

    j

    (xt

    − µ))wj

    ������������2

    2

    = n�t=1

    d�j=k+1�w�

    j

    (xt

    − µ)�2 = n�t=1

    d�j=k+1

    w

    �j

    (xt

    − µ)(xt

    − µ)�wj

  • PCA: MINIMIZING RECONSTRUCTION ERROR

    Goal: find the basis that minimizes reconstruction error,

    n�t=1�x̂

    t

    − xt

    �22 = n�t=1������������

    k�j=1

    y

    t

    [j]wj

    + µ − xt

    ������������2

    2

    = n�t=1������������

    k�j=1

    y

    t

    [j]wj

    + µ − d�j=1

    y

    t

    [j]wj

    − µ������������

    2

    2

    = n�t=1������������

    d�j=k+1

    y

    t

    [j]wj

    ������������2

    2

    (note that yt

    [j] =w�j

    (xt

    − µ))

    = n�t=1������������

    d�j=k+1(w�

    j

    (xt

    − µ))wj

    ������������2

    2

    = n�t=1

    d�j=k+1�w�

    j

    (xt

    − µ)�2 = n�t=1

    d�j=k+1

    w

    �j

    (xt

    − µ)(xt

    − µ)�wj

  • PCA: MINIMIZING RECONSTRUCTION ERROR

    Goal: find the basis that minimizes reconstruction error,

    1n

    n�t=1�x̂

    t

    − xt

    �22 = 1n

    n�t=1

    d�j=k+1

    w

    �j

    (xt

    − µ)(xt

    − µ)�wj

    = d�j=k+1

    w

    �j

    ⌃wj

    Minimize w.r.t. w’s that are orthonormal,

    argmin∀j, �w

    j

    �2=1d�

    j=k+1w

    �j

    ⌃wj

    Using Lagrangian multipliers, there exists �k+1, . . . ,�d such that

    solution to above is given by:

    minimizen�

    t=1d�

    j=k+1w

    �j

    ⌃wj

    + d�j=k+1

    �j

    �w

    j

    �22Setting derivate to 0, ⌃w

    j

    = �j

    w

    j

    . That is wj

    ’s are eigenvectorsand �

    j

    ’s are eigenvalues.

  • PCA: MINIMIZING RECONSTRUCTION ERROR

    Goal: find the basis that minimizes reconstruction error,

    1n

    n�t=1�x̂

    t

    − xt

    �22 = 1n

    n�t=1

    d�j=k+1

    w

    �j

    (xt

    − µ)(xt

    − µ)�wj

    = d�j=k+1

    w

    �j

    ⌃wj

    Minimize w.r.t. w’s that are orthonormal,

    argmin∀j, �w

    j

    �2=1d�

    j=k+1w

    �j

    ⌃wj

    Using Lagrangian multipliers, there exists �k+1, . . . ,�d such that

    solution to above is given by:

    minimizen�

    t=1d�

    j=k+1w

    �j

    ⌃wj

    + d�j=k+1

    �j

    �w

    j

    �22Setting derivate to 0, ⌃w

    j

    = �j

    w

    j

    . That is wj

    ’s are eigenvectorsand �

    j

    ’s are eigenvalues.

  • PCA: MINIMIZING RECONSTRUCTION ERROR

    Solution : wj

    ’s are eigenvectors and �j

    ’s are correspondingeigenvaluesFurther, reconstruction error can be written as:

    argminw∶�w

    j

    �2=1d�

    j=k+1w

    �j

    ⌃wj

    = d�j=k+1

    �j

    w

    �j

    w

    j

    = d�j=k+1

    �j

    Clearly to minimize reconstruction error, we need to minimize∑dj=k+1 �j. In other words we discard the d − k directions that have

    the smallest eigenvalue

  • PRINCIPAL COMPONENT ANALYSIS

    Eigenvectors of the covariance matrix are the principalcomponents

    Top K principal components are the eigenvectors with K largesteigenvalues

    Projection = Data × Top KeigenvectorsReconstruction = Projection × Transpose of top K eigenvectorsIndependently discovered by Pearson in 1901 and Hotelling in1933.

    ⌃ =cov X !

    1.

    eigs= ⌃ ,K( )W2.

    3. Y = W⇥X�µ

  • RECONSTRUCTION

    4.Y= ⇥bX W> +µ

  • WHEN d >> n

    If d >> n then ⌃ is largeBut we only need top K eigen vectors.Idea: use SVD

    X − µ = UDV�Then note that, ⌃ = (X − µ)(X − µ)� = UD2U

    Hence, matrix U is the same as matrix W got from eigendecomposition of ⌃, eigenvalues are diagonal elements of D2

    Alternative algorithm:

    W = SVD(X − µ,K)

  • PRINCIPAL COMPONENT ANALYSIS: DEMO