lazyprogrammer.me principal components analysis tutorial (pca)

6
11/24/2015 LazyProgrammer.me - data science and big data things. http://lazyprogrammer.me/2015/11/20/tutorial-principal-components-analysis-pca/ 1/6 Home (/) About (/about) Data Science Courses (/data-science-courses) Machine Learning and Statistics (/machine-learning-statistics) Databases: SQL, NoSQL, and Big Data (/databases-sql- nosql-big-data) Ruby on Rails (/ruby-on-rails) Python (/python) Command Line (/command- line) Project Management (/project-management) Consulting (/consulting) SEO / Marketing (/seo-marketing) Tutoring (/tutoring) Portfolio (/portfolio) Contact Chronicles of a Lazy Programmer (/) Your source for the latest in big data, data science, and coding for startups. Sign up now Subscribe Email Address Kids Science Games Huge List of Kids Science Games and Projects for School Students. Tutorial: Principal Components Analysis (PCA) (http://lazyprogrammer.me/2015/11/20/tutorial-principal-components- analysis-pca/) I remember learning about principal components analysis for the very first time. I remember thinking it was very confusing, and that I didn’t know what it had to do with eigenvalues and eigenvectors (I’m not even sure I remembered what eigenvalues and eigenvectors were at the time). I assure you that in hindsight, understanding PCA, despite its very scientific-sounding name, is not that difficult to understand at all. There are two “categories” I think it makes sense to split the discussion of PCA into. 1) What does it do? A written and visual description of how PCA is used. 2) The math behind PCA. A few equations showing that the things mentioned in (1) are true. What does PCA do? Linear transformation PCA finds a matrix that, when multiplied by your original data matrix , gives you a linearly transformed data matrix , where: In other words, for a single sample vector , we can obtain its transformation . The and appear in a different order here because when we load a data matrix, it’s usually , so each sample vector is , but when we talk about vectors by themselves we usually think of them as . The interesting thing about PCA is how it chooses . Dimensionality reduction

Upload: lazyprogrammerme

Post on 24-Jul-2016

237 views

Category:

Documents


1 download

DESCRIPTION

Tutorial on principal components analysis (PCA). Requires knowledge of statistics and linear algebra, in particular, knowing how to calculate the covariance matrix and what eigenvalues and eigenvectors are.

TRANSCRIPT

11/24/2015 LazyProgrammer.me - data science and big data things.

http://lazyprogrammer.me/2015/11/20/tutorial-principal-components-analysis-pca/ 1/6

Home (/) About (/about) Data Science Courses (/data-science-courses) Machine Learningand Statistics (/machine-learning-statistics) Databases: SQL, NoSQL, and Big Data (/databases-sql-nosql-big-data) Ruby on Rails (/ruby-on-rails) Python (/python) Command Line (/command-line) Project Management (/project-management) Consulting (/consulting) SEO / Marketing

(/seo-marketing) Tutoring (/tutoring) Portfolio (/portfolio) Contact

Chronicles of a LazyProgrammer (/)

Your source for the latest in big data, data science, and coding for startups. Sign up now

Subscribe

Email Address

Kids Science GamesHuge List of Kids Science Games and Projects for School Students.

Tutorial: Principal Components Analysis (PCA)(http://lazyprogrammer.me/2015/11/20/tutorial-principal-components-analysis-pca/)I remember learning about principal components analysis for the very first time. I remember thinking it was very confusing, and that Ididn’t know what it had to do with eigenvalues and eigenvectors (I’m not even sure I remembered what eigenvalues and eigenvectors wereat the time).

I assure you that in hindsight, understanding PCA, despite its very scientific-sounding name, is not that difficult to understand at all. Thereare two “categories” I think it makes sense to split the discussion of PCA into.

1) What does it do? A written and visual description of how PCA is used.

2) The math behind PCA. A few equations showing that the things mentioned in (1) are true.

What does PCA do?Linear transformation

PCA finds a matrix that, when multiplied by your original data matrix , gives you a linearly transformed data matrix , where:

In other words, for a single sample vector , we can obtain its transformation .

The and appear in a different order here because when we load a data matrix, it’s usually , so each sample vector is , butwhen we talk about vectors by themselves we usually think of them as .

The interesting thing about PCA is how it chooses .

Dimensionality reduction

Q X Z

Z = XQ

x z = Qx

Q x N × D 1 × DD × 1

Q

11/24/2015 LazyProgrammer.me - data science and big data things.

http://lazyprogrammer.me/2015/11/20/tutorial-principal-components-analysis-pca/ 2/6

When we think of machine learning models we often study them in the context of 1-3 dimensions. But if you take a dataset like MNIST,where each image is 28×28 pixels, then each input is 784 dimensions. 28×28 is a tiny image. The new iPhone 6 has a resolution of 1080 ×1920 pixels. Each image would thus have 2073600 2 million dimensions.

PCA reduces dimensionality by moving as much “information” as possible into as few dimensions as possible. We measure information viaunpredictability, i.e. variance. The end result is that our transformed matrix has most of its variance in the first column, less variance inthe second column, even less variance in the third column, etc.

You will see later how our choice of accomplishes this.

De-correlation

is a much bigger problem than the ones you learn about in the classroom. But often there are correlations in the data thatmake many of the dimensions redundant. This is related to the problem of linear regression (http://gum.co/datasciencelinearregression) – ifwe can accurately predict one dimension of the inputs, let’s say, , in terms of another, , then one of those dimensions is redundant.

The final result of removing these correlations is that every dimension of is uncorrelated with the other.

(http://lazyprogrammer.me/wp-content/uploads/2015/11/PCA.jpg)

The image above shows a scatterplot of the original data on the original x-y axis. The arrows on the plot itself show that the new v1-v2 axisis a) rotated (matrix multiplication rotates a vector) b) still perpendicular, and c) the data relative to the new axis is normally distributed andthe distributions on each dimension are independent and uncorrelated.

Visualization

Once we’ve transformed into , noting that most of the “information” from is in the first few dimensions of – let’s say… 90% of thevariance is in the first 2 dimensions – we can do a 2-D scatterplot of Z to see the structure of our data. Humans have a hard timevisualization anything greater than 2 dimensions.

We can consider the first 2 dimensions to represent the real underlying trend in the data, and the other dimensions just small perturbationsor noise.

Pre-processing

You may have heard that too many parameters in your model can lead to overfitting. So if you are planning to use a logistic regression(https://gum.co/logistic-regression-data-science-python) model, and the dimensionality of each input is 2 million, then the number ofweights in your model will also be 2 million. This is not good news if you have only a few thousand samples. One rule of thumb is we wouldlike to have 10x the number of samples compared to the number of parameters in our model.

So instead of going out and finding 20 million samples, we can use PCA to reduce the dimensionality of our data to say, 20, and then weonly need 200 samples for our model.

You can also use PCA to pre-process data before using an unsupervised learning algorithm, like k-means clustering. PCA, by the way, is alsoan unsupervised algorithm.

The math behind PCAEverybody’s favorite part!

Ok, so this requires both (a) some statistics knowledge (knowing how to find the covariance matrix), and (b) some linear algebra knowledge(knowing what eigenvalues and eigenvectors are, and basic matrix manipulation).

Covariance

I stated above that in the rotated v1-v2 coordinate system, the data on the v1 axis was not correlated with the data on the v2 axis.

a

Z

Q

D a 2000000

x3 x5

Z

X Z X Z

11/24/2015 LazyProgrammer.me - data science and big data things.

http://lazyprogrammer.me/2015/11/20/tutorial-principal-components-analysis-pca/ 3/6

Intuitively, when you have a 2-D Gaussian distribution, the data looks like an ellipse. If that ellipse is perpendicular to the x-y grid, then thex and y components are independent and uncorrelated.

The image below shows what happens with different values of the correlation between and (called ‘c’ in the image):

(http://lazyprogrammer.me/wp-content/uploads/2015/11/ellipses.gif)

(If you really know your statistics, then you will recall that independence implies 0 correlation, but 0 correlation does not implyindependence, unless the distribution is a joint Gaussian.)

So, in the first image, since the ellipse is not perpendicular to the x-y axis, the distributions p(x) and p(y) are not independent. But in therotated v1-v2 axis, the distributions p(v1) and p(v2) are independent and uncorrelated.

If this all sounds crazy, just remember this one fact: if 2 variables and are uncorrelated, their corresponding element in the covariancematrix, . And since the covariance matrix is symmetric (for our purposes), also.

For every variable to be uncorrelated to every other variable, all the elements of will be 0, except those along the diagonal, , whichis just the regular variance for each dimension.

How do we calculate the covariance matrix from the data? Simple!

In non-matrix form this is:

I’ve labeled the equation because we’re going to return to it. You can verify that is since those are the outer dimensions of thematrix multiply.

Note the mathematical sleight of hand I used above. is and the mean is , so technically you can’t subtract them under therules of matrix algebra, but let’s assume that the above notation means each row of has subtracted from it.

Eigenvalues and Eigenvectors

This is where the real magic of PCA happens.

For no particular reason at all, suppose we compute the eigenvalues and eigenvectors of .

If you don’t know what eigenvalues/eigenvectors are: remember that usually, when we multiply a vector by a matrix, it changes thedirection of the vector.

So for example, the vector is pointing in the same direction as . But if we were to multiply by a matrix:

The result is a vector in a different direction. Eigenvalues and eigenvectors have the property that, if you multiply , this is the sameby multiplying the eigenvector by a constant – the eigenvalue, i.e. the eigenvector does not change direction, it only gets shorter or longerby a factor of . In other words:

Xi Xj

Xi Xj

= 0Σij = 0Σji

Σ =Σii σ2i

= (X– (X– )ΣX1N

μX)T μX (1)

= ( – )( – )Σij1N ∑

i=1

N

xi μi xj μj

Σ D × D

X N × D μX 1 × DX μX

ΣX

(1, 2) (2, 4) = 2(1, 2) (1, 2)

( ) = ( ) ( )511

13

24

12

λ v vΣX

λ

v = λvΣX

11/24/2015 LazyProgrammer.me - data science and big data things.

http://lazyprogrammer.me/2015/11/20/tutorial-principal-components-analysis-pca/ 4/6

How many pairs of these eigenvectors and eigenvalues can we find? It just so happens that there are such pairs. We’ll say, each eigenvalue has a corresponding eigenvector .

Now again, for no particular reason at all, we are going to combine all the into a matrix, let’s call it .

We could line up the in any random order, but instead we are going to choose a very particular order – we are going to line them up suchthat the corresponding eigenvalues are in descending order. In other words:

So:

We are also going to combine the eigenvalues into a diagonal matrix like this:

As a sidenote, don’t worry about how we find the eigenvalues and eigenvectors. That is a huge problem in and of itself. The method youused in high school – solving a polynomial to get the eigenvalues, plugging the eigenvalues into the eigenvalue equation to get theeigenvectors, etc. doesn’t really translate to computer code. There exist some very involved algorithms to solve this problem, but we’ll justuse numpy.

In any case, it is easy to prove to yourself, with a simple 2-D example, that:

Just one more ingredient. We want to show that is orthonormal. In terms of the individual eigenvectors this means that if ,and . In matrix form this means , or in other words, .

Proof of orthogonality: Consider for . . But we know that is symmetric so this is . Since , this can only be true if .

Normalizing eigenvectors is easy since they are not unique – just choose values so that their length is 1.

Finally, using , consider the covariance of the transformed data, .

Now, suppose we choose . Then, by plugging in , we get:

So what does this tell us? Since is a diagonal matrix, there are no correlations in the transformed data. The variance of each dimension of is equal to the eigenvalues. In addition, because we sorted the eigenvalues in descending order, the first dimension of has the most

variance, the second dimension has the second-most variance, etc. So most of the information is kept in the leading columns, as promised.

You can verify this in Python:

Dλi vi

vi V

vi

λi

> > > … >λ1 λ2 λ3 λD

V =⎛

⎝⎜⎜

|v1

|

|v2

|

|v3

|

………

|vD

|

⎠⎟⎟

Λ =

⎜⎜⎜⎜

λ1

0…0

0λ2

…0

…………

00…λD

⎟⎟⎟⎟

V = VΛΣX (2)

V = 0viTvj i y j

= 1viTvi V = IV T V = =V T V+1

λjviTvj i y j = ( ) = = (λjvi

Tvj viT λjvj vi

TΣXvj ΣTXvi )T vj ΣX

( = ( =ΣXvi )T vj λivi )T vj λiviTvj yλi λj = 0vi

Tvj

(1) Z

ΣZ = (Z– (Z– )1N

μZ)T μZ

= (XQ– Q (XQ– Q)1N

μX )T μX

= (X– (X– )Q1N

QT μX)T μX

= QQTΣX

Q = V (2)

ΣZ = VV TΣX

= VΛV T

= Λ

ΛZ Z

11/24/2015 LazyProgrammer.me - data science and big data things.

http://lazyprogrammer.me/2015/11/20/tutorial-principal-components-analysis-pca/ 5/6

import numpy as npfrom sklearn.decomposition import PCAimport matplotlib.pyplot as plt

pca = PCA()X = np.random.random((100,10)) # generate an N = 100, D = 10 random data matrixZ = pca.fit_transform(X)

# visualize the covariance of Zplt.imshow(np.cov(Z.T))plt.show()

(http://lazyprogrammer.me/wp-content/uploads/2015/11/covariance.png)

Observe that the off-diagonals are 0 and that the variances are in decreasing order.

You can also confirm that .

QTQ = pca.components_.T.dot(pca.components_)plt.imshow(QTQ)plt.show()

print np.around(QTQ, decimals=2)

(http://lazyprogrammer.me/wp-content/uploads/2015/11/qtq.png)

Remember to leave a comment below if you have any questions on the above material. If you want to go more in depth on this and otherdata science topics (both with the math and the code), check out some of my data science video courses online(http://lazyprogrammer.me/data-science-courses).

Tweet

Q = IQT

ESET® Encryption 2016Encrypt Data In Files, Folders, Or Entire Hard Drives - Try Free Now!

Shares

11/24/2015 LazyProgrammer.me - data science and big data things.

http://lazyprogrammer.me/2015/11/20/tutorial-principal-components-analysis-pca/ 6/6

0 Comments Lazy Programmer Login1

Share⤤ Sort by Best

Start the discussion…

Be the first to comment.

Subscribe✉ Add Disqus to your sited Privacy�

Recommend

github.com/graemedouglas/… LittleD – SQL database for IoT, can run queries in 1KB #machinelearning #ec2

LazyProgrammer.me @lazy_scientist

Show Summary

techcrunch.com/2015/11/23/and… Google Launches Android Studio 2.0 #datascience #bigdata #machinelearning #hadoop #ec2 #google

LazyProgrammer.me @lazy_scientist

Show Summary

4h

23 Nov

Tweets Follow

Tweet to @lazy_scientist

Home (/) About (/about) Data Science Courses (/data-science-courses) Machine Learning and Statistics (/machine-learning-statistics) Databases: SQL, NoSQL, and Big

Data (/databases-sql-nosql-big-data) Ruby on Rails (/ruby-on-rails) Python (/python) Command Line (/command-line) Project Management (/project-management)

Consulting (/consulting) SEO / Marketing (/seo-marketing) Tutoring (/tutoring) Portfolio (/portfolio)

© 2015 LazyProgrammer.me