principal component analysis - wordpress.com · 2016-03-15 · principal component analysis l....

Principal Component Analysis

L. Graesser

March 14, 2016

Introduction

Principal Component Analysis (PCA) is a popular technique for reducing the size of adataset. Let’s assume that the dataset is structured into an m x n matrix where each rowrepresents a data sample and each column represents a feature. Reducing the size of thedataset is achieved by first linearly transforming the dataset, retaining the original dimen-sions, and then selecting the first 1 : k, (k < n) columns of the resulting matrix. PCA isuseful for three reasons. First, it reduces the amount of memory a dataset takes up withthe minimal loss of information, and so can significantly speed up the runtime of learningalgorithms. Second, it reveals the underlying structure of the data, and in particular therelationship between features. It reveals how many uncorrelated relationships there are in adataset. This is the rank of a dataset, the maximum number of linearly independent columnvectors. More formally, PCA re-expresses the features of a dataset in an orthogonal basisof the same dimension as the original features. This has the nice property that the newfeatures are uncorrelated. That is, the new set of features are an orthogonal set of vectors.Finally, PCA can help to prevent a learning algorithm from overfitting the training data.

This objective of this paper is to explain what PCA is and to explore when it is and isnot useful for data analysis. Section one explains the mathematics of PCA. It starts with thedesirable properties of the transformed dataset and works through the mathematics whichguarantee these properties. It is intended to be understood by a reader who has a basicunderstanding of linear algebra, but can be skipped if readers wish only to see the appli-cation of PCA to datasets. Section two uses toy datasets to demonstrate what happens tothe principal components and the accuracy of a simple principal component regression whenthe variance of features change. Section three explores the trade off between dimensionalityreduction using PCA and the performance of a linear model. It compares the performanceof linear regression, ridge regression and principal component regression in predicting themedian household income of US counties. The objective of this section is to show how theaccuracy of principal component regression changes as the number of principal components(dimensions) is reduced, and to assess how effective PCA is in preventing overfitting whencompared to ridge regression1.

The paper ends with a summary of the results and a brief discussion of possible exten-sions. A separate appendix is provided which contains the detailed results of the medianincome analysis and Matlab code for a number of functions that I found useful to writeduring this process is available on github at https://github.com/lgraesser/PCA

1regularized linear regression

1

2

Section 1: The Mathematics of Principal Component Analysis

Let X be an m x n matrix. Each row represents an individual data sample, so there are msamples, and each column represents a feature, so there are n features. Let X = [x1, ....., xn],where xi is a column vector ∈ Rm. Principal Component Analysis is the process of trans-forming a dataset, X, into a new dataset Y = [y1, ....., yn], yi ∈ Rm. The objective is to find atransformation matrix, P , such that Y = XP , and which ensures that Y 2 has the followingproperties.

(A) Each feature, yi, is uncorrelated with all other features. That is, the covariance3

between features is 0. Let µyi = 1m

∑mj=1 y

ji , the mean of vector yi, then the covariance

between two features, yi, and yj is as follows.

Covyi,yj =(yi − µyi)

T (yj − µyj)

m− 1=

yTi yjm− 1

= 0 ∀i 6= j

Note: the rightmost equality holds if and only if yi and yj are mean normalized, whichby assumption they are. See footnote 2.

(B) The features are ordered from highest to lowest variance, left to right. y1 should havethe highest variance, yn the lowest. This will turn out to be essential when reducingthe dimensionality of the dataset Y by choosing the first k < n columns of Y .

(C) The total variance of the original dataset is unchanged, that is X and Y have the sametotal variance.

The columns of Y are then called the principal components of X.

Mathematically, Principal Component Analysis is very elegant. At its heart is the prop-erty that all symmetric matrices4 are orthogonally diagonalizable5. This means that forany symmetric n x n matrix, A, there exists an orthogonal6 matrix E = [e1, ..., en], and adiagonal matrix, D, such that,

2Throughout this derivation, the resulting matrix, Y , is assumed to be in mean deviation form. Thismeans that the mean of each yi ∈ Y has been subtracted from each element of yi. This guarantees that themean of each yi ∈ Y is 0.

3a measure of how much changes in one variable is associated with changes in another variable) betweeneach feature

4A symmetric matrix is a matrix where A = AT

5The Spectral Theorem of Symmetric Matrices, Lay6An orthogonal matrix is a matrix in which the norm (length) of each of its columns = 1, and every

column vector is orthogonal to every other column vector. Let E = [e1, ..., en] be an orthogonal matrix,then, eTi ej = 0 ∀ i 6= j and, eTi ej = 1 ∀ i = j

3

A = EDE−1 = EDET , since E is orthogonal and square and so ETE = I

The transformation matrix, P , will be derived in two ways. First, it is derived usingthe fact that all symmetric matrices are orthogonally diagonalizable. Second, it is derivedusing the Singular Value Decomposition (SVD). Looking at both derivations is interestingbecause it highlights the relationship between the eigenvalues of XTX, the variance of thefeature vectors, and the Singular Value Decomposition, one of the most widely used matrixdecompositions. In practice, it is often easiest to find P and the resulting Y using the SVD.

Derivation 1: Let Y = XP , and suppose that Y is mean-normalized. To satisfy prop-erty (A), the covariance of the columns of Y need to be zero, which is equivalent to requiringthat Y TY be a diagonal matrix.

XTX is a symmetric matrix (since (XTX)T = XTXTT = XTX), so there exists an or-thogonal matrix, E, and a diagonal matrix, D, such that XTX = EDET . So, we have

Y TY = (XP T )T (XP )

= P TXTXP

= P TEDETP

Let P = E, then

Y TY = ETEDETE

= D

(1)

From this, it is clear that property (A) is satisfied. Let (d1, ...., dn) be the diagonals ofD. Then, we have that,

Y TY =

yT1.....

yTn

( y1 ... yn

)

=

yT1 y1 yT1 y2 .... yT1 yn

....

yTn y1 yTn y2 .... yTn yn

=

d1 0 .... 0

0 d2 .... 0

....

0 0 .... dn

(2)

The covariance of features yi and yj, ∀i 6= j, is zero. Which was what we were lookingfor. The fact that P is an orthogonal matrix also guarantees that property (C) is satisfied,

4

since an orthogonal change of variables does not change the total variance of the data7. Atthis point we know that a matrix, P , exists such that Y = XP will satisfy properties (A)and (C). But we do not yet know if property (B) is satisfied. The second derivation usingthe Singular Value Decomposition will make this clear.

In this next section it is useful to keep in mind the properties of matrices E and Dfrom this derivation. The diagonals of D contain the distinct eigenvalues8 of XTX, and thecolumns of E are the corresponding eigenvectors9, which by definition are an orthogonal setsince any two eigenvectors corresponding to distinct eigenvalues (distinct eigenspaces) of asymmetric matrix are orthogonal10.

Derivation 2: The following theorem11 will be crucial to guaranteeing that property(B) is satisfied as it is closely related to the variance of the new features vectors in Y .

Let A be any matrix, let x be any unit vector of appropriate dimension, and let λ1 be thegreatest eigenvalue of ATA . Then, the maximum length that Ax can have is,

max ‖ Ax ‖= max((Ax)TAx))12 = max (xTATAx)

12 = λ

121

The Singular Value Decomposition is an extremely useful matrix decomposition since itstates that any m x n matrix A can be decomposed as follows,

A = UΣV T

U is an orthogonal m x m matrix

V is an orthogonal n x n matrix

Σ =

(D 0

0 0

)Note: if Σ is square, the bottom right 0-block is excluded

(3)

Where Σ is m x n, D is a diagonal r x r matrix, r = Column rank (A) = rank(A), and wherethe diagonals of D are the first r singular values of A ordered from largest to smallest, leftto right, that is, σ1 ≥ σ2 ≥ ... ≥ σr > 0, where σi is a non zero diagonal entry of D.

7Linear Algebra and its Applications, Lay, pg 4818An eigenvalue is a scalar, λ, such that Ax = λx, where x is non zero.9An eigenvector x is a non zero vector such that, Ax = λx, where A is an m x n matrix, λ is a scalar.

Eigenvectors are said to correspond to eigenvalues.10The Spectral Theorem of Symmetric Matrices11Linear Algebra and its Applications, Lay, pg 460

5

Let X = UΣV T and substituting X = UΣV T into Y TY = P TXTXP , we have,

Y TY = P TXTXP

= P T (UΣV T )T (UΣV T )P

= P TV ΣTUTUΣV TP

= P TV Σ2V TP

Let P = V, then

Y TY = V TV Σ2V TV

= Σ2

(4)

From this, we know that,

Squared singular values of X = Σ2

= D = eigenvalues of XTX

(after ordering the eigenvalues largest to smallest, left to right)

Singular vectors of X = V

= E = orthogonal set of eigenvectors of XTX

(ordered to correspond to the ordered eigenvalues of D)(5)

Now we can prove that this transformation P , does in fact guarantee that property (B) issatisfied. We know that,

X = UΣV T

Y = XP = XV = UΣV TV = UΣ = [σ1u1 , ...., σrur 0 , ...., 0]

yi = σiui

(6)

Now, consider the variance of the first feature of Y , y1,

V ary1 =m∑i=1

((y1 − µ1)2)

=m∑i=1

y21 since Y is mean normalized

= yT1 y1

= (σ1u1)T (σ1u1)

= uT1 σ21u1

= σ21u

T1 u1 since σ1 is a scalar

= σ21 since U is orthogonal so u1 is a unit vector

= λ1 , the largest eigenvector of XTX

(7)

6

Recall that the maximum value that ‖ Xa ‖ can take for any appropriately dimensionedunit vector a, is λ1, the largest eigenvalue of XTX. So it is not possible for the varianceof any of the y-vectors to be larger that that of y1. And, since the transformation matrix,V , preserves the variance of X, we know that y1 is the vector representing the maximaldirection variance of the features (columns) of matrix X. From an extension of the sametheorem12, it can be established that y2 is the vector representing the direction of the secondhighest variance of the matrix X, constrained by that fact that y1 and y2 are orthogonal.And more generally, yi is the vector representing the direction of maximum variance of Xgiven that yi and yj are orthogonal ∀ j < i. So, the feature vectors (columns) of Y are or-dered from the direction of the highest variance in X to the lowest variance, from left to right.

It is now clear that our transformed dataset Y = XV , where V is the matrix of rightsingular vectors of the Singular Value Decomposition of X, has all of the properties that wewere looking for.

Having obtained our matrix of principal components, Y , we can now reduce the dimen-sionality of Y simply by selecting the leftmost k < n columns, knowing that these featurescontain the maximum possible information (variance) about the original dataset X, for anyk < n. The proportion of the variance of the original data captured in these k columns iscalculated by the sum of the first k singular values of X divided by the sum of all of thesingular values of X. How many principal components to choose depends on the applicationand the dataset. We will see an example of this in practice in Section 3.

Section 2: PCA applied: changing variable variance in principal componentregression

The following simple examples demonstrate how the principal components of a datasetand the performance of a principal component regression13 changes as the variance of thefeatures change. To show this I created two datasets, which each contained five datasets.Each dataset contained two features, x1 and x2 ∈ R1000, and one target variable, y ∈ R1000.All features were mean normalized. In every dataset, x1 is a strong predictor of y, and alinear model, y = β0 + β1x1 has an R2 of ≈ 97 − 98% for each dataset14. In the first set offive datasets, x2 is also correlated with y, but the feature contains more noise15, and so is aless good predictor of y. The relationship between x1, x2, and y is displayed in the chartsbelow.

12Linear Algebra and its Applications, Lay, pg 46313Principal component regression is simply linear regression using the first k ≤ n columns of the trans-

formed dataset Y = XV to predict a target variable14R2 is the percentage of variance of the target variable explained by the model. It is calculated as follows:

Total sum of squares (TSS) =∑

i(yi − y)2. Explained sum of squares (ESS) =∑

i(fi − y)2. Residual sumof squares (RSS) =

∑i(yi − fi)2 R2 = 1−RSS/TSS.

15modeled by increasing the variance of x2 in a way that is uncorrelated with y

7

-50 0 50

x1

-60

-40

-20

0

20

40

60

x2

Dataset 1: x1 vs. x2

-50 0 50

x1

-60

-40

-20

0

20

40

60

y

Dataset 1: x1 vs y

-50 0 50

x2

-60

-40

-20

0

20

40

60

y

Dataset 1: x2 vs y

-50 0 50

x1

-60

-40

-20

0

20

40

60

x2


-50 0 50

x1

-60

-40

-20

0

20

40

60

y

Dataset 2: x1 vs y

-50 0 50

x2

-60

-40

-20

0

20

40

60

y

Dataset 2: x2 vs y

-50 0 50

x1

-60

-40

-20

0

20

40

60

x2


-50 0 50

x1

-60

-40

-20

0

20

40

60

y

Dataset 3: x1 vs y

-50 0 50

x2

-60

-40

-20

0

20

40

60

y

Dataset 3: x2 vs y

-50 0 50

x1

-60

-40

-20

0

20

40

60

x2


-50 0 50

x1

-60

-40

-20

0

20

40

60

y

Dataset 4: x1 vs y

-50 0 50

x2

-60

-40

-20

0

20

40

60

y

Dataset 4: x2 vs y

-50 0 50

x1

-60

-40

-20

0

20

40

60

x2


-50 0 50

x1

-60

-40

-20

0

20

40

60

y

Dataset 5: x1 vs y

-50 0 50

x2

-60

-40

-20

0

20

40

60

y

Dataset 5: x2 vs y

As the variance of x2 increases, the accuracy16 of a principal component regression usingjust the first principal component falls, despite the fact that y is best described as a func-tion of one variable. Whilst principal component regression does not perform as poorly asthe linear model y = β0 + β1x2 (see table 2 below), it is significantly worse than the linearmodel y = β0 + β1x1. This is because the maximal direction of variance, the first principalcomponent, in each dataset is a linear combination of the both x1 and x2, with the relativeweight of x2 increasing as the variance of x2 increases (see table 1 below). So, the presenceof the high variance but less good predictor, x2, makes the model worse.

16measured by R2

8

Table 1: Principal Component 1: x1 and x2 coefficientsModel x1 coefficient x2 coefficient

Var X1 = 51, X2 = 63 0.67 0.74Var X1 = 50, X2 = 73 0.62 0.78Var X1 = 45, X2 = 86 0.54 0.84Var X1 = 49, X2 = 102 0.51 0.86Var X1 = 52, X2 = 109 0.51 0.86

Table 2: Results (R2): X2, X1 correlated with y, X2 variance increasedModel PCR: 2 comps PCR: 1 comp y = β0 + β1x1 y = β0 + β1x2

Var X1 = 51, X2 = 63 0.976 0.928 0.975 0.819Var X1 = 50, X2 = 73 0.974 0.842 0.974 0.643Var X1 = 45, X2 = 86 0.972 0.746 0.971 0.539Var X1 = 49, X2 = 102 0.975 0.683 0.975 0.469Var X1 = 52, X2 = 109 0.977 0.664 0.977 0.443

In the second set of five datasets, x2 is not correlated with y, and the variance of x2 isgradually increased. The relationship between x1, x2, and y is displayed in the charts below.

9

-50 0 50

x1

-60

-40

-20

0

20

40

60

x2


-50 0 50

x1

-60

-40

-20

0

20

40

60

y

Dataset 1: x1 vs y

-50 0 50

x2

-60

-40

-20

0

20

40

60

y

Dataset 1: x2 vs y

-50 0 50

x1

-60

-40

-20

0

20

40

60

x2


-50 0 50

x1

-60

-40

-20

0

20

40

60

y

Dataset 2: x1 vs y

-50 0 50

x2

-60

-40

-20

0

20

40

60

y

Dataset 2: x2 vs y

-50 0 50

x1

-60

-40

-20

0

20

40

60

x2


-50 0 50

x1

-60

-40

-20

0

20

40

60

y

Dataset 3: x1 vs y

-50 0 50

x2

-60

-40

-20

0

20

40

60

y

Dataset 3: x2 vs y

-50 0 50

x1

-60

-40

-20

0

20

40

60

x2


-50 0 50

x1

-60

-40

-20

0

20

40

60

y

Dataset 4: x1 vs y

-50 0 50

x2

-60

-40

-20

0

20

40

60

yDataset 4: x2 vs y

-50 0 50

x1

-60

-40

-20

0

20

40

60

x2


-50 0 50

x1

-60

-40

-20

0

20

40

60

y

Dataset 5: x1 vs y

-50 0 50

x2

-60

-40

-20

0

20

40

60

y

Dataset 5: x2 vs y

This second example is much starker. The accuracy of the principal component regressionwith one component is barely affected until the variance of x2 is close to that of x1, andwhen the variance of x2 > x1, the model ceases to be useful since the principal componentregression stops being able to explain any of the variance of the target variable, y. (see table4). This is to be expected since the first principal component is almost entirely composed ofthe feature with the larger variance (see table 3), which is effective so long as this feature isthe predictive feature, x1.

Table 3: Principal Component 1: x1 and x2 coefficientsModel x1 coefficient x2 coefficient

Var X1 = 51, X2 = 10 1.00 0.03Var X1 = 50, X2 = 26 1.00 -0.02Var X1 = 52, X2 = 37 0.99 -0.11Var X1 = 51, X2 = 49 0.93 0.36Var X1 = 51, X2 = 61 0.12 0.99

10

Table 4: Results (R2): X1 correlated with y, X2 uncorrelatedModel PCR: 2 comps PCR: 1 comp

Var X1 = 51, X2 = 10 0.976 0.975Var X1 = 51, X2 = 26 0.975 0.975Var X1 = 52, X2 = 37 0.977 0.968Var X1 = 51, X2 = 49 0.975 0.857Var X1 = 51, X2 = 61 0.973 0.01

Both examples highlight the importance of ensuring that features all have the same vari-ance before carrying out PCA, since the composition of the principal components is highlysensitive to the variance of features. However this may not completely solve the problemencountered in the first set of datasets. If features are correlated, even if the variance ofthe features are standardized to one, a linear combination of multiple features may have avariance > 1. For example, if the best predictor of the target variable is a single feature,since it has a lower variance than a linear combination of correlated features, the first prin-cipal component will not be as predictive as that feature. In other words, the direction ofhighest variance in the data is not the most meaningful, which is contrary to a fundamentalassumption of PCA.

By standardizing the variance of x1 and x2 and re-running the analysis from the firstset of five datasets (see table 5), this issue becomes clear. Whilst the accuracy of theprincipal component regression with one principal component has improved compared towhen feature variance was not standardized, it is still significantly worse than the linearmodel y = β0 + β1x1.

Table 5: Results (R2): X2, X1 correlated with y, X2 ”noisier”, standard varianceModel PCR: 2 PC PCR: 1 PC y = β0 + β1x1 y = β0 + β1x2

Var x1=47, x2=57, std var x1=x2=1 0.973 0.929 0.973 0.805Var x1=56, x2=81, std var x1=x2=1 0.977 0.9 0.977 0.692Var x1=55, x2=100, std var x1=x2=1 0.975 0.855 0.975 0.561Var x1=49, x2 =100, std var x1=x2=1 0.975 0.831 0.975 0.482Var x1=47, x2=111, std var x1=x2=1 0.974 0.812 0.974 0.431

Section 3: Predicting Median Household Income in US Counties

This section examines the effect of changing the number of principal components on theperformance of principal component regression. Then, the performance of principal compo-nent regression is compared with linear regression and ridge regression with the objective ofevaluating whether PCA is a good approach for preventing overfitting.

11

The dataset17 used through this section is a set of 15 features for US counties, andthis information is used to predict the median household income by county. The data isstructured into an 3,143 x 15 matrix, X, in which a row represents the information for asingle county, and a column represents a single feature. The 15 features are as follows:

(1) Population 2014, absolute

(2) Population 2010 measured in April, absolute

(3) Population 2010 measure at end of year, absolute

(4) Number of Veterans, absolute

(5) Number of Housing Units, absolute

(6) Private non farm employment, absolute

(7) Total number of firms, absolute

(8) Manufacturing shipments, $k

(9) Retail sales, value, $k

(10) Accommodation and food services sales, $k

(11) Foreign born persons, %

(12) High school graduates or higher, %

(13) Bachelor’s degree or higher, %

(14) Persons below the poverty line, %

(15) Women owned firms, %

The two figures below plot each feature vs. median household income. The first figure plotsthe raw data. In the second figure the data has been mean-normalized and the features scaledby dividing each element of each feature vector by the corresponding standard deviation forthat feature.

Examining the plots, a first observation is that the range of values of the feature variessignificantly. From the mathematical discussion in section two, it is clear that PCA is highlysensitive to the absolute range of features since it affects the variance of these features.If the data analyzed as is, the principal components would be dominated by the variancein the features with the largest absolute values. This is not necessarily of interest, sinceit might be the case that variability of features with a smaller set of absolute values, for

17Source: census.gov. See references for link to data

12

example, the % of the population with a BA degree or higher, are better predictors of me-dian income than say, the size of a county. So, in order to be able to capture informativedifferences in the variation of a feature, all of the analysis is performed on the scaled dataset.

It also appears that features 1 - 10 are highly correlated and are essentially differentmeasures of the size of a county. This is confirmed by examining the covariance matrix18

of the features (columns of X). Feature 11, the percent of foreign born persons is alsosomewhat correlated to features 1-10 and not to any others. Feature 12, the percent ofhigh school graduates or higher, is not correlated with features 1 - 11, but it reasonablystrongly correlated with the percent holding a bachelor’s degree or higher (feature 13) and(inversely) with the percent of persons below the poverty line (feature 14). Feature 13, thepercent holding a bachelor’s degree or higher is most strongly correlated with feature 12,and somewhat (inversely) with feature 14. Feature 15, the percent of women owned firms,is not strongly correlated with any other feature.

18see Appendix

13

0 5 10 15

Pop 14×10

6

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14

×104 Counties 2: raw data

0 2 4 6 8 10

Pop 10 1×10

6

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14


0 2 4 6 8 10

Pop 10 2×10

6

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14


0 1 2 3 4

Num veterans×10

5

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14


0 1 2 3 4

Housing Units×10

6

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14


0 1 2 3 4

Private non farm employment×10

6

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14


0 5 10 15

Num firms×10

5

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14


0 0.5 1 1.5 2

Manufacturing shipments×10

8

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14


0 5 10 15

Retail sales×10

7

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14


0 0.5 1 1.5 2 2.5

Accom and food sales×10

7

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14


0 20 40 60

% foreign born

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14


40 60 80 100

High school or higher

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14


0 20 40 60 80

BA degree or higher

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14


0 20 40 60

% below poverty line

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14


0 20 40 60

% women owned firms

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14


14

-10 0 10 20 30 40

Pop 14

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14

×104 Counties 2: norm,sc

-10 0 10 20 30 40

Pop 10 1

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14


-10 0 10 20 30 40

Pop 10 2

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14


-5 0 5 10 15 20

Num veterans

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14


-10 0 10 20 30

Housing Units

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14


-10 0 10 20 30

Private non farm employment

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14


-10 0 10 20 30 40

Num firms

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14


-10 0 10 20 30

Manufacturing shipments

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14


-10 0 10 20 30

Retail sales

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14


-10 0 10 20 30

Accom and food sales

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14


-5 0 5 10

% foreign born

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14


-6 -4 -2 0 2 4

High school or higher

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14


-2 0 2 4 6 8

BA degree or higher

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14


-4 -2 0 2 4 6

% below poverty line

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14


-2 -1 0 1 2 3

% women owned firms

0

2

4

6

8

10

12

14

M H

H I

nc.

07

-14


Given the high degree of correlation between features, it should be possible to reduce thenumber of principal components used without having a significant impact on the perfor-mance of a principal component regression model.

The coefficient matrix (see Appendix) helps us to learn more about the structure ofthe data. Component 1 most heavily weights features 1 - 10 in the original dataset. Thissupports the hypothesis that features 1 - 10 are related variations of one measure, the sizeof a county. Component 2 is mostly composed of features 12 - 14, indicating that a linearcombination of these three features is the next highest direction of variance in the data.Components 3 and 4 are mostly a combination of features 11 and 15, whilst component 5 is

15

mostly a combination of features 13 and 14. The largest direction of variance in the data iscounty size, the next four highest directions of variance are mostly a combination of features11 - 15. However, it may still have been useful to include all of features 1 - 10 since the first5 principal components explain the majority, 74.83%, of the variance in the data, but notall (see tables below).

Percentage of variance explained by the first k principal components (I / II)Num. Principal Components 1 2 3 4 5 6 7 8

Percent Variance Explained 31.41 47.38 58.11 67.63 74.83 81.12 86.05 90.31

Percentage of variance explained by the first k principal components (II / II)Num. Principal Components 9 10 11 12 13 14 15

Percent Variance Explained 94.18 96.50 98.01 99.02 99.77 100.00 100.00

Next, I compared the performance of principal component regression19 with linear regres-sion and ridge regression. What follows is a brief summary of the methodology and results;see the Appendix for further detail. The dataset was split into a training set containing80% of the samples (counties), randomly selected. 20% was kept as a holdout set for testinghow well the models generalized to unseen data. For both principal component regressionand ridge regression, 10-fold cross-validation20 was used to select the number of principalcomponents and value of the regularization parameter, k21.

As the number of principal components is reduced from 12 to 8, the performance of aprincipal component regression on training data stays stable with an R2 of 77%− 78% (seegraph below), after which it starts to fall, and begins to drop precipitously with fewer than5 components. This suggests that there are a minimum of 5 dimensions that are criticalto predicting median county income. The performance of model on unseen test data variessignificantly between 5 and 12 components, suggesting some overfitting of the data, butfollows the same pattern as the training data with fewer than 5 components. Five principalcomponents therefore appears to be the optimal number to use.

19Principal component regression is simply linear regression using k ≤ n columns of the transformeddataset XPCA = XV to predict the target variable, y

20The training data was further split into 10 different sets of training data and validation data. The trainingdata in each set contained 90% of the total training set samples, randomly selected, and the remaining 10%was held out as a validation set to test the model on unseen data.

21Whilst the value of k that achieved the highest performance on the validation set was k = 0.01, valuesof k from 0.03 to 1 performed significantly worse. I therefore chose a more conservative k = 3 which alsoachieved a good result on the validation set

16

3 4 5 6 7 8 9 10 11 12

Number of principal components

0.64

0.66

0.68

0.7

0.72

0.74

0.76

0.78

0.8

0.82

Prin

cip

al co

mp

on

en

t re

gre

ssio

n r

sq

Principal component rsq

rsq test

rsq training

Once the the number of principal components and size of k had been fixed (at PC = 5and k = 3), each model was tested on the original test dataset using the results from thetraining data. Finally, six further training and holdout datasets were generated, the data splitrandomly between the two, with the same ratio of 80:20, training:test. Taking inspirationfrom Janecek, Gansterer, Demel and Ecker’s paper, ”On the Relationship between FeatureSelection and Classification Accuracy”, JMLR: Workshop and Conference Proceedings 4:90-105, I also created two combined models to test on these additional datasets. For thefirst combined model, the first 5 principal components were added to the original matrixafter which a normal linear regression process was followed. The second model used thesame combined dataset but followed a ridge regression process. For each set, each modelwas trained on the training data and tested on the holdout dataset. The performance of allthe models (R2) on the unseen holdout set is summarized below:

Results: Model performance on holdout test sets (bold font = best model per test)Model Result 1 Result 2 Result 3 Result 4 Result 5 Result 6 Result 7 Avg

PCR 0.7346 0.7494 0.7419 0.7548 0.7133 0.7555 0.7159 0.7379LR 0.7669 0.7248 0.7606 0.7748 0.7051 0.7756 0.7436 0.7502RR 0.7685 0.7458 0.7624 0.7758 0.7269 0.7751 0.7458 0.7572

LR + PCR 0.7248 0.7606 0.7748 0.7051 0.7756 0.7436 0.7502RR + PCR 0.7392 0.7598 0.7752 0.7196 0.7775 0.7467 0.7530

Average 0.7567 0.7368 0.7571 0.7711 0.7140 0.7719 0.7391 0.7491

17

Overall, ridge regression generalized best to unseen data, whilst principal component re-gression performed worst. The strong performance of the simple linear regression providessome insight as to why. Since the linear model generalized well to unseen data, this suggeststhat although there was high correlation between the first 10 features, they still containedinformation that was useful for predicting median income. That is, the informative numberof dimensions in the dataset was closer to 15 than 5, the number of dimensions chosen forthe principal component regression.

The combined models were not a success. Linear regression and PCA generated exactlythe same results as linear regression. This is to be expected given that the principal compo-nents are linear combinations of the columns of the original dataset, and so adding principalcomponents simply duplicates information. Ridge regression and PCA was more interestingsince on two of six occasions this model achieved the best result. However, in these casesthe performance of the ridge regression was close, and overall, ridge regression outperformedthe combined model.

Examining the difference in performance between the training and the holdout set (dis-played in the tables below) shows that principal component regression does generalize rel-atively better than linear regression, but not as well as ridge regression. Linear regressionperformed best on the training data but also overfit the most, with a 3.3 percentage pointsdrop in performance between the training and test data. In contrast, principal componentregression only had a 2.6 percentage points drop, but ridge regression overfit the least, withonly a 1.6 percentage points drop.

Results: Model performance on training sets (bold font = best model per test)Model Result 1 Result 2 Result 3 Result 4 Result 5 Result 6 Result 7 Avg

PCR 0.7577 0.7458 0.7785 0.7525 0.7981 0.7538 0.7623 0.7641LR 0.7828 0.7796 0.7848 0.7808 0.7901 0.7804 0.7876 0.7837RR 0.7950 0.7802 0.7189 0.7816 0.7963 0.7732 0.7651 0.7729

LR + PCR 0.7796 0.7848 0.7808 0.7901 0.7804 0.7876 0.7839*RR + PCR 0.7790 0.7845 0.7805 0.7897 0.7800 0.7872 0.7776

Average 0.7785 0.7728 0.7703 0.7752 0.7929 0.7736 0.7780 0.7776

* Note: Linear regression and linear regression + PCA lead to the same results.One less result for LR+PCA

Comparing average performance: training set vs. test set

18

Model Training Avg. Test Avg. Delta (test - training)

PCR 0.7641 0.7379 -0.0262LR 0.7837 0.7502 -0.0335RR 0.7729 0.7572 -0.0157

LR + PCR 0.7839 0.7474 -0.0365RR + PCR 0.7835 0.7530 -0.0305

Average 0.7776 0.7491 -0.0285

Based on this analysis, it appears that ridge regression is the best model to use if overfit-ting is a concern. It has the benefit of not discarding any information which could potentiallybe useful but penalizes large coefficients leading to better performance on unseen data. Itis however interesting to see that principal component regression achieves performance towithin 2 percentage points of ridge regression despite using a dataset one third of the size.

Section 4: Conclusion

Principal Component Analysis reveals useful information about the nature and structureof a set of features and so is a valuable process to follow during initial data exploration.Additionally, it provides an elegant way to reduce the size of a dataset with minimal loss ofinformation. Thus minimizing the fall in performance of a linear model as the size of thedataset is reduced. However, the performance of principal component regression suggeststhat if data compression is not necessary, a better approach to preventing overfitting of linearmodels is to use ridge regression.

It would be interesting to extend this analysis by comparing the performance of princi-pal component regression and ridge regression on a dataset for which a simple linear modelsignificantly overfits unseen data, i.e. that does not generalize well. Additionally it would beinformative to explore the effect of applying principal component analysis to reduce the sizeof a dataset to use with more complex models such as neural networks or k-nearest-neighborsclassification. I would be interested to understand under what circumstances, if any, reducingthe size of a dataset using PCA improves the performance of a predictive model on unseendata when compared with training data.

Acknowledgements

Thank you to Professor Margaret Wright for giving me the idea for a project on Prin-cipal Component Regression in the first place, and for pointing me towards the excellenttutorial on the subject by Jonathon Shlens; to Clinical Assistant Professor Sophie Marquesfor discussing the mathematics of orthogonal transformations and for sparking the idea tosystematically change the variance of x2; and to Professor David Rosenberg for suggestingI compare principal component regression to ridge regression and that I read Elements ofStatistical Learning.

19

References

(1) Linear Algebra and its Applications, David Lay, Chapter 7

(2) Elements of Statistical Learning, Hastie, Tibshirani, Friedman, Chapter 3

(3) A Tutorial on Principal Component Analysis, Jonathan Shlens

(4) On the Relationship between Feature Selection and Classification Accuracy, JMLR,Janecek, Gansterer, Demel, Ecker

(5) Coursera, Machine Learning, Andrew Ng, week 7 (K means clustering and PCA)

(6) PCA or Polluting your Clever Analysis: This gave me the idea for examining theperformance of principal component regression as the variance of the features and theirrelationship to the predictor variable changed. http : //www.r − bloggers.com/pca−or − polluting − your − clever − analysis/

(7) Mathworks: tutorials on principal component analysis and ridge regression

(8) Modern Regression: Ridge regression, Ryan Tibshirani, March 2013

(9) Lecture 17 Principal Component Analysis, Shippensburg University of Pennsylvania

(10) PCA, SVD, and LDA, Shanshan Ding

(11) http : //yatani.jp/teaching/doku.php?id = hcistats : PCA

(12) https : //en.wikipedia.org/wiki/Principalcomponentanalysis

(13) http : //stats.stackexchange.com/questions/134282/relationship− between− svd−and− pca− how − to− use− svd− to− perform− pca

(14) US County Data: census.gov http : //quickfacts.census.gov/qfd/downloaddata.html

principal component analysis - wordpress.com · 2016-03-15 · principal component analysis l....

Documents