lecture notes on principal component analysis · lecture notes on principal component analysis ......

Lecture Notes on

Principal Component Analysis

Laurenz WiskottInstitut fur Neuroinformatik

Ruhr-Universitat Bochum, Germany, EU

14 December 2016

Contents

1 Intuition 2

1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Projection and reconstruction error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Reconstruction error and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 Covariance matrix and higher order structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.6 PCA by diagonalizing the covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Formalism 5

2.1 Definition of the PCA-optimization problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Matrix VT : Mapping from high-dimensional old coordinate system to low-dimensional newcoordinate system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Matrix V: Mapping from low-dimensional new coordinate system to subspace in old coordi-nate system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 Matrix (VTV): Identity mapping within new coordinate system . . . . . . . . . . . . . . . . 8

2.5 Matrix (VVT ): Projection from high- to low-dimensional (sub)space within old coordinatesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.6 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.7 Reconstruction error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.8 Covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.9 Eigenvalue equation of the covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.10 Total variance of the data x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.11 Diagonalizing the covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.12 Variance of y for a diagonalized covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . 12

© 2004–2006, 2009, 2010, 2013, 2016 Laurenz Wiskott (homepage https://www.ini.rub.de/PEOPLE/wiskott/). Thiswork (except for all figures from other sources, if present) is licensed under the Creative Commons Attribution-ShareAlike 4.0International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/. Figures fromother sources have their own copyright, which is generally indicated. Do not distribute parts of these lecture notes showingfigures with non-free copyrights (here usually figures I have the rights to publish but you don’t, like my own published figures).Figures I do not have the rights to publish are grayed out, but the word ’Figure’, ’Image’, or the like in the reference is oftenlinked to a pdf.More teaching material is available at https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/.

1

https://www.ini.rub.de/PEOPLE/wiskott/

http://creativecommons.org/licenses/by-sa/4.0/

https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/

2.13 Constraints of matrix V′ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.14 Finding the optimal subspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.15 Interpretation of the result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.16 PCA Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.17 Intuition of the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.18 Whitening or sphering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.19 Singular value decomposition + . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Application 16

3.1 Face processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Other resources 18

4.1 Written material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2 Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.3 Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.4 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Acknowledgment 20

These lecture notes are partly based on (Hertz et al., 1991).

1 Intuition

1.1 Problem statement

Experimental data to be analyzed is often represented as a number of vectors of fixed dimensionality. Asingle vector could for example be a set of temperature measurements across Germany. Taking such a vectorof measurements at different times results in a number of vectors that altogether constitute the data. Eachvector can also be interpreted as a point in a high dimensional space. Then the data are simply a cloud ofpoints in this space (if one ignores the temporal order, otherwise it would be a trajectory).

When analyzing such data one often encounters the problem that the dimensionality of the data points istoo high to be visualized or analyzed with some particular technique. Thus the problem arises to reduce thedimensionality of the data in some optimal way.

To keep things simple we insist that the dimensionality reduction is done linearly, i.e. we are looking for alow-dimensional linear subspace of the data space, onto which the data can be projected. As a criterion forwhat the optimal subspace might be it seems reasonable to require that it should be possible to reconstructthe original data points from the reduced ones as well as possible. Thus if one were to project the databack from the low-dimensional space into the original high-dimensional space, the reconstructed data pointsshould lie as close as possible to the original ones, with the mean squared distance between original andreconstructed data points being the reconstruction error. The question is, how can we find the linear subspacethat minimizes this reconstruction error.

It is useful and common practice to remove the mean value from the data first before doing the dimensionalityreduction as stated above. Thus, we assume zero mean data throughout. As a result variances and 2ndmoments are the same. This justifies the slightly confusing common practice to speak of variances but writethe equations for 2nd moments. Please keep that in mind.

2

1.2 Projection and reconstruction error

The task of principal component analysis (PCA) is to reduce the dimensionality of some high-dimensionaldata points by linearly projecting them onto a lower-dimensional space in such a way that the reconstructionerror made by this projection is minimal. In order to develop an intuition for PCA we first take a closerlook at what it means to project the data points and to reconstruct them. Figure 1 illustrates the process.

data points in 2D projection onto 1D data points in 1D reconstruction in 2D

xx

x||

x||

y x1

x2

y

x1

x2

x1

x2

y

(c) (d)(a) (b)

Figure 1: Projection of 2D data points onto a 1D subspace and their reconstruction.© CC BY-SA 4.0

(a) A few data points are given in a two-dimensional space and are represented by two-dimensional vectorsx = (x1, x2). (b) In order to reduce the dimensionality down to one, we have to choose a one-dimensionalsubspace defined by a unit vector v and project the data points onto it, which can be done by

x‖ := vvTx . (1)

(c) The points can now be represented by just one number,

y := vTx , (2)

and we do not care that they originally came from a two-dimensional space. (d) If we want to reconstructthe original two-dimensional positions of the data points as well as possible, we have to embed the one-dimensional space in the original two-dimensional space in exactly the orientation used during the projection,

x‖(1,2)= vy . (3)

However, we cannot recover the accurate 2D-position; the points remain on the one-dimensional sub-space. The reconstruction error is therefore the average distance of the original 2D-positions from theone-dimensional subspace (the length of the projection arrows in (b)). For mathematical convenience oneactually takes the average squared distance

E := 〈‖xµ − xµ||‖2〉µ (4)

=1

M

M∑µ=1

I∑i=1

(xµi − xµ|| i)

2 , (5)

where µ indicates the different data points, M the number of data points, and I the dimensionality of thedata vectors.

1.3 Reconstruction error and variance

The question now is how we can find the direction of the one-dimensional subspace that minimizes thereconstruction error. For that it is interesting to inspect more closely what happens as we rotate thesubspace. Figure 2 illustrates the projection onto two different subspaces. Focus just on the one point xand its projection x||. d is the distance of x from the origin, r is the distance of x from x|| in the subspace,and v is the distance of x|| from the origin. r and v depend on the direction of the subspace while d doesnot. Interestingly, since the triangles between x, x||, and the origin are right-angled, r and v are related by

3

https://creativecommons.org/licenses/by-sa/4.0/

y

x||

x||

x1

x2

x

(b)

x1

x2

y

(a)

x

r

d

vd

rv

Figure 2: Variance of the projected data and reconstruction error as the linear subspace is rotated.© CC BY-SA 4.0

Pythagoras’ theorem, i.e. r2 + v2 = d2. We know that r2 contributes to the reconstruction error. v2 on theother hand contributes to the variance of the projected data within the subspace. Thus we see that the sumover the reconstruction error plus the variance of the projected data is constant and equals the variance ofthe original data. Therefore, minimizing the reconstruction error is equivalent to maximizing the varianceof the projected data.

1.4 Covariance matrix

How can we determine the direction of maximal variance? The first we can do is to determine the variancesof the individual components. If the data points (or vectors) are written as x = (x1, x2)T (T indicatestranspose), then the variances of the first and second component can be written as C11 := 〈x1x1〉 andC22 := 〈x2x2〉, where angle brackets indicate averaging over all data points. Please remember that these arestrictly speaking 2nd moments and not variances, but since we assume zero mean data that does not make adifference. If C11 is large compared to C22, then the direction of maximal variance is close to (1, 0)T , whileif C11 is small, the direction of maximal variance is close to (0, 1)T . (Notice that variance doesn’t have apolarity, so that one could use the inverse vector (−1, 0)T instead of (1, 0)T equally well for indicating thedirection of maximal variance.)

But what if C11 is of similar value as C22, like in the example of Figure 1? Then the co-variance between thetwo components, C12 := 〈x1x2〉, can give us additional information (notice that C21 := 〈x2x1〉 is equal toC12). A large positive value of C12 indicates a strong correlation between x1 and x2 and that the data cloudis extended along the (1, 1)T direction. A negative value would indicate anti-correlation and an extensionalong the (−1, 1)T direction. A small value of C12 would indicate no correlation and thus little structure ofthe data, i.e. no prominent direction of maximal variance. The variances and covariances are convenientlyarranged in a matrix with components

• Cij := 〈xixj〉 , (6)

which is called covariance matrix (remember, assuming zero mean data)1. It can easily be shownthat the components obey the relation

• C2ij ≤ CiiCjj . (7)

It is also easy to see that scaling the data by a factor α scales the covariance matrix by a factor α2. Figure 3shows several data clouds and the corresponding covariance matrices.

1.5 Covariance matrix and higher order structure

Notice that the covariance matrix only gives you information about the general extent of thedata (the second order moments). It does not give you any information about the higher-order structure of the data cloud. Figure 4 shows different data distributions that all have the same

1Important text (but not inline formulas) is set bold face; • marks important formulas worth remembering; ◦ marks lessimportant formulas, which I also discuss in the lecture; + marks sections that I typically skip during my lectures.

4


x1

x2

x1

x2

1

1

1

1

1

1

x1

x2

0

0.2 0

1 0.3

−0.5

−0.5

1

1

01

0

Figure 3: Several data distributions and their covariance matrices.© CC BY-SA 4.0

covariance matrix. Thus as long as we consider only the covariance matrix, i.e. second order moments, wecan always assume a Gaussian data distribution with an ellipsoid shape, because the covariance matrix doesnot represent any more structure in any case.

x1

x2

x1

x2

x1

x2

1

1

1

1

1

1

1 0

10

1 0

0 1

1 0

0 1

Figure 4: Different data distributions with identical covariance matrices.© CC BY-SA 4.0

1.6 PCA by diagonalizing the covariance matrix

Now that we have learned that the covariance matrix in principle contains the information about the directionof maximal variance the question arises how we can get at this information. From Figure 3 (a) and (b) wecan see that there are two fundamentally different situations: in (a) the data cloud is aligned with the axesof the coordinate system and the covariance matrix is diagonal; in (b) the data cloud is oblique to the axesand the matrix is not diagonal. In the former case the direction of maximal variance is simply the axisbelonging to the largest value on the diagonal of the covariance matrix. In the latter case, we cannot directlysay what the direction of maximal variance might be. Thus, since the case of a diagonal covariance matrixis so much simpler, the strategy we are going to take is to make a non-diagonal covariance matrix digonalby rotating the coordinate system accordingly. This is illustrated in Figure 5. From linear algebra we knowthat diagonalizing a matrix can be done by solving the corresponding eigenvalue equation. It will turn outthat the eigenvectors of the covariance matrix point into the directions of maximal (and minimal) varianceand that the eigenvalues are equal to the variances along these directions. Projecting the data onto theeigenvectors with largest eigenvalues is therefore the optimal linear dimensionality reduction.

2 Formalism

2.1 Definition of the PCA-optimization problem

The problem of principal component analysis (PCA) can be formally stated as follows.

5



0.3

−0.5

−0.5

1

x1

x2

2x’

0

0

1x’

1.1

0.1

Figure 5: Diagonalizing the covariance matrix by rotating the coordinate system.© CC BY-SA 4.0

Principal Component Analysis (PCA): Given a set {xµ : µ = 1, ...,M} of I-dimensional data points xµ = (xµ1 , x

µ2 , ..., x

µI )T with zero mean, 〈xµ〉µ = 0I , find an

orthogonal matrix U with determinant |U| = +1 generating the transformed datapoints x′

µ:= UTxµ such that for any given dimensionality P the data projected

onto the first P axes, x′µ|| := (x′

µ1 , x′µ2 , ..., x

′µP , 0, ..., 0)T , have the smallest

reconstruction error E := 〈‖x′µ − x′µ||‖2〉µ (8)

among all possible projections onto a P -dimensional subspace. The row vectors ofmatrix U define the new axes and are called the principal components.

Some remarks: (i) 〈xµ〉µ indicates the mean over all M data points indexed with µ. To simplify the notationwe will from now on drop the index µ and indicate averages over the data points by 〈·〉. (ii) If one hasnon-zero-mean data, one typically removes the mean before applying PCA. Even though all the math isvalid also for non-zero-mean data, the results would typically be undesired and nonintuitive. (iii) Sincematrix U is orthogonal and has determinant value +1, it corresponds simply to a rotation of the data x.Thus, the ’shape’ of the data cloud remains the same, just the ’perspective’ changes. |U| = −1 would implya mirror reflection of the data distribution and is often permitted, too. Note also that one can interpret themultiplication with matrix UT either as a rotation of the data or as a rotation of the coordinate system.Either interpretation is valid. (iv) Projecting the data x′ onto the P -dimensional linear subspace spanned bythe first P axes is simply done by setting all components higher than P to zero. This can be done, becausewe still have an orthonormal coordinate system. If U and therefore the new coordinate system were notorthogonal then the projection became a mathematically more complex operation. (v) The reconstructionerror has to be minimal for any P . This has the advantage that we do not need to decide on P beforeperforming PCA. Often P is actually choosen based on information obtained during PCA and governed bya constraint, such as that the reconstruction error should be below a certain threshold.

2.2 Matrix VT : Mapping from high-dimensional old coordinate system to low-dimensional new coordinate system

Assume some data points x are given in an I-dimensional space and a linear subspace isspanned by P orthonormal vectors

◦ vp := (v1p, v2p, ..., vIp)T (9)

◦ with vTp vq = δpq :=

{1 if p = q0 otherwise

. (10)

We will typically assume P < I and speak of a high(I)-dimensional space and a low(P )-dimensional(sub)space. However, P = I may be possible as a limiting case as well.

6


Arranging these vectors in a matrix yields

• V := (v1,v2, ...,vP ) (11)

(9)=

v11 v12 ... v1Pv21 v22 ... v2P...

. . ....

vI1 vI2 ... vIP

. (12)

This matrix can be used to map the data points x into the subspace spanned by the vectors vpyielding

◦ y := VTx , (13)

see also figure 6. If P < I then the dimensionality is reduced and some information is lost; if P = I all

VT

VT

x||

VT

x

yx1

x2

y

data points in 1Ddata points in 2D

V

V

V

Figure 6: The effect of matrices VT and V and combinations thereof for an example of a mapping from2D to 1D.© CC BY-SA 4.0

information is preserved. In any case the mapped data are now represented in a new coordinate system theaxes of which are given by the vectors vp. With P = 2 and I = 3, for example, we have

y =

(vT1 xvT2 x

)=

(vT1vT2

)x = VTx

or y =

(y1y2

)=

(v11 v21 v31v12 v22 v32

) x1x2x3

= VTx .

Note that y is P -dimensional while x is I-dimensional.

It is important to realize that we have done two things here: firstly, we have moved the points fromthe high-dimensional space onto the low-dimensional subspace (the points that were already in thesubspace have not been moved, of course) and secondly, we have represented the moved points ina new coordinate system that is particularly suitable for the low-dimensional subspace. Thus, we wentfrom the high-dimensional space and the old coordinate system to the low-dimensional subspace and a newcoordinate system. Note also that points in the high-dimensional space can generally not be representedaccurately in the new coordinate system, because it does not have enough dimensions.

2.3 Matrix V: Mapping from low-dimensional new coordinate system to sub-space in old coordinate system

Interestingly, since the vectors vp are orthonormal, matrix V can also be used to transform thepoints back from the new to the old coordinate system, although, the lost dimensions cannotbe recovered, of course. Thus the mapped points y in the new coordinate system become points x|| in theold coordinate system and are given by

◦ x|| := Vy (14)

◦ (13)= VVTx . (15)

7


y and x|| are equivalent representations, i.e. they contain the same information, just in differentcoordinate systems.

2.4 Matrix (VTV): Identity mapping within new coordinate system

Before we look at the combined matrix VVT consider VTV. The latter is obviously a P × P -matrixand performs a transformation from the new (low-dimensional) coordinate system to the old(high-dimensional) coordinate system (14) and back again (13). The back-transformation implies adimensionality reduction, but since all points in the old coordinate system come from the new coordinatesystem and therefore lie within the low-dimensional subspace already, the mapping onto the low-dimensional space does not discard any information. Thus, only the back and forth (or rather forth andback) transformation between the two coordinate systems remains and that in combination is withoutany effect either. This means that VTV is the identity matrix, which can be easily verified(

VTV)pq

= vTp vq(10)= δpq (16)

⇐⇒ VTV = 1P (17)

with 1P indicating the identity matrix of dimensionality P . With P = 2, for example, we have

VTV =

(vT1vT2

)(v1v2) =

(vT1 v1 vT1 v2

vT2 v1 vT2 v2

)(10)=

(1 00 1

).

2.5 Matrix (VVT ): Projection from high- to low-dimensional (sub)space withinold coordinate system

As we have seen above (15) the combined matrix VVT maps the points x onto the low-dimensionalsubspace but in contrast to matrix VT alone the mapped points are represented within the old coor-dinate system and not the new one. It turns out that this is a projection operation with thecharacterizing property that it does not make a difference whether you apply it once or twice, i.e.PP = P. Let us therefore define the projection matrix

• P := VVT (18)

and verify that

◦ PP(18)= V VT V︸︷︷︸

1P

VT (19)

◦ (17)= VVT (20)

◦ (18)= P . (21)

8

A closer look at P shows that

◦ P := VVT (22)

◦ = (v1, ...,vP )

vT1...

vTP

(23)

(9)=

v11 v12 · · ·

v21. . .

v31. . .

... vIP

v11 v21 v31 · · ·

v12. . .

. . .... vIP

(24)

=

∑p v1pv1p

∑p v1pv2p · · ·∑

p v2pv1p. . .

...∑p vIpvIp

(25)

=

P∑p=1

v1pv1p v1pv2p · · ·

v2pv1p. . .

... vIpvIp

(26)

◦ =

P∑p=1

vpvTp . (27)

P is obviously an I × I-matrix. If P = I then projecting from the old to the new and back to the oldcoordinate system causes no information loss and P = 1I . The smaller P the more information is lostand the more does P differ from the identity matrix. Consider, for example

v1 :=1

2(√

2,−1, 1)T ⇒ v1vT1 =

1

4

2 −√

2√

2

−√

2 1 −1√2 −1 1

,

v2 :=1

2(0,√

2,√

2)T ⇒ v2vT2 =

1

4

0 0 00 2 20 2 2

, and

v3 :=1

2(−√

2,−1, 1)T ⇒ v3vT3 =

1

4

2√

2 −√

2√2 1 −1

−√

2 −1 1

for which you can easily verify that P (27) successively becomes the identity matrix as you take more of thevpv

Tp -terms.

2.6 Variance

The variance of a multi-dimensional data set is defined as the sum over the variances of itscomponents. Since we assume zero-mean data, we have

• var(x) :=

I∑i=1

〈x2i 〉 (28)

◦ =

⟨I∑i=1

x2i

⟩(29)

• = 〈xTx〉 (30)

This also holds for the projected data, of course, var(y) = 〈yTy〉.

9

2.7 Reconstruction error

The reconstruction error E is defined as the mean square sum over the distances between the original datapoints x and the projected ones x‖. If we define the orthogonal vectors (D: Lotvektoren)

◦ x⊥ = x− x‖ (31)

(in contrast to the projected vectors x‖) we can write the reconstruction error as the variance of theorthogonal vectors and find

•|◦ E(8,31)= 〈x⊥Tx⊥〉 (32)(31)= 〈(x− x‖)

T (x− x‖)〉 (33)

◦ (15)= 〈(x−VVTx)T (x−VVTx)〉 (34)

◦ = 〈xTx− 2xTVVTx + xTV (VTV)︸︷︷︸=1P

VTx〉 (35)

◦ (17)= 〈xTx〉 − 〈xTV (VTV)︸︷︷︸

=1P

VTx〉 (36)

◦ (15)= 〈xTx〉 − 〈x‖Tx‖〉 (37)

• (36,13)= 〈xTx〉 − 〈yTy〉 . (38)

This means that the reconstruction error equals the difference between the variance of the dataminus the variance of the projected data. Thus, this verifies our intuition that minimizing the recon-struction error is equivalent to maximizing the variance of the projected data.

2.8 Covariance matrix

We have already argued heuristically that the covariance matrix Cx with Cxij := 〈xixj〉 plays an importantrole in performing PCA. It is convenient to write the covariance matrix in vector notation:

• Cx :=⟨xxT

⟩=

1

M

∑µ

xµxµT . (39)

It is an easy exercise to show that this definition is equivalent to the componentwise one given above. Since(xxT )T = xxT (remember (AB)T = BTAT ), one can also see that Cx is symmetric, i.e. CT

x = Cx.

Keep in mind that Cx is strictly speaking a 2nd moment matrix and only identical to the covariance matrixsince we assume zero mean data.

2.9 Eigenvalue equation of the covariance matrix

Since the covariance matrix is symmetric, its eigenvalues are real and a set of orthogonal eigenvectors alwaysexists. In mathematical terms, for a given covariance matrix Cx we can always find a complete setof real eigenvalues λi and corresponding eigenvectors ui such that

◦ Cxui = uiλi (eigenvalue equation) , (40)

• λi ≥ λi+1 (eigenvalues are ordered) , (41)

◦ uTi uj = δij (eigenvectors are orthonormal) . (42)

If we combine the eigenvectors into an orthogonal matrix U and the eigenvalues into a diagonal matrix Λ,

• U := (u1,u2, ...,uI) , (43)

• Λ := diag(λ1, λ2, ..., λI) , (44)

10

then we can rewrite (42) and (40) as

• UTU(42,43)

= 1I (matrix U is orthogonal) , (45)

◦ ⇐⇒ UUT = 1I (since U−1 = UT and U is quadratic) , (46)

• CxU(40,43,44)

= UΛ (eigenvalue equation) , (47)

◦ (45)⇐⇒ UTCxU = Λ (48)

◦ (45,46)⇐⇒ Cx = UΛUT . (49)

2.10 Total variance of the data x

Given the eigenvector matrix U and the eigenvalue matrix Λ it is easy to compute the total variance of thedata

•|◦ 〈xTx〉 = 〈tr(xTx)〉 (since s = tr(s) for any scalar s) (50)

◦ = 〈tr(xxT )〉 (since tr(AB) = tr(BA) for any matrices A, B) (51)

◦ = tr(〈xxT 〉) (since tr(·) and 〈·〉 commute) (52)

◦ (39)= tr(Cx) (53)

◦ (46)= tr(UUTCx) (54)

◦ = tr(UTCxU) (55)

◦ (48)= tr(Λ) (56)

• (44)=

∑i

λi . (57)

Thus the total variance of the data is simply the sum of the eigenvalues of its covariance matrix.

Notice that on the way of this proof we have shown some very general properties. From line (50) to (53) wehave shown that the total variance of some multi-dimensional data equals the trace of its covariance matrix.From line (53) to (55) we have shown that the trace remains invariant under any orthogonal transformationof the coordinate system. This implies that the total variance of some multi-dimensional data is invariantunder any orthogonal transformation such as a rotation. This is intuitively clear.

2.11 Diagonalizing the covariance matrix

We can now use matrix U to transform the data such that the covariance matrix becomesdiagonal. Define x′ := UTx and denote the new covariance matrix by C′x. We have

• x′ := UTx (58)

•|◦ C′x :=⟨x′x′

T⟩

(59)

◦ (58)=

⟨(UTx)(UTx)T

⟩(60)

◦ = UT⟨xxT

⟩U (61)

◦ (39)= UTCxU , (62)

• (48)= Λ (63)

and find that the transformed data x′ have a diagonal covariance matrix. Working with x′ instead of x willsimplify further analysis without loss of generality.

11

2.12 Variance of y for a diagonalized covariance matrix

Now that we have the data represented in a coordinate system in which the covariance matrix is diagonal, wecan try to answer the question, which is the P -dimensional subspace that minimizes the reconstruction error.Our intuition would predict that it is simply the space spanned by the first P eigenvectors. To show thisanalytically, we take an arbitrary set of P orthonormal vectors v′p, and with V′ := (v′1,v

′2, ...,v

′P )

we compute the variance of y.

◦ y := V′Tx′ (64)

◦ =⇒ 〈yTy〉 (64)= 〈x′TV′V′

Tx′〉 (65)

= 〈tr(x′TV′V′Tx′)〉 (since s = tr(s) for any scalar s) (66)

= 〈tr(V′Tx′x′TV′)〉 (since tr(ABC) = tr(BCA) if defined) (67)

◦ (59)= tr(V′

TC′xV

′) (since tr(·) and 〈·〉 commute) (68)

◦ (63)= tr(V′

TΛV′) (69)

◦ =∑i

λi∑p

(v′ip)2 . (as one can work out on a sheet of paper) (70)

2.13 Constraints of matrix V′

Note that, since the vectors v′p are orthonormal, V′ can always be completed to an orthogonal I × I-matrix by adding I − P additional orthonormal vectors. Since we know that an orthogonal matrix hasnormalized row as well as column vectors, we see that, by taking away the I −P additional column vectors,we are left with the constraints

◦∑i

(v′ip)2 = 1 (column vectors of V′ have norm one) , (71)

◦ =⇒∑ip

(v′ip)2 = P (square sum over all matrix elements equals P ) , (72)

◦∑p

(v′ip)2 ≤ 1 (row vectors of V′ have norm less or equal one) . (73)

Notice that Constraint (72) is a direct consequence of Constraint (71) and does not need to be verifiedseparately in the following considerations.

2.14 Finding the optimal subspace

Since the variance (70) of y as well as the constraints (71, 72, 73) of Matrix V′ are linear in(v′ip)

2, maximization of the variance 〈yTy〉 is obviously achieved by putting as much ’weight’ aspossible on the large eigenvalues, which are the first ones. The simplest way of doing that is to set

◦ v′ip := δip :=

{1 if i = p0 otherwise

, (74)

with the Kronecker symbol δip.

Since I ≥ P we can verify the constraints∑p

(v′ip)2 (74)

=∑p

δ2ip =

{1 if i ≤ P0 otherwise

}≤ 1 , (75)

∑i

(v′ip)2 (74)

=∑i

δ2ip = δ2pp = 1 , (76)

and see from (75) that there is actually as much ’weight’ on the first, i.e. large, eigenvalues as Constraint (73)permits.

12

2.15 Interpretation of the result

What does it mean to set v′ip := δip? It means that V′ projects the data x′ onto the first P axes,which in fact is a projection onto the first P eigenvectors of the covariance matrix Cx. Thus, ifwe define

•|◦ V := UV′ (77)

• (43, 74)= (u1,u2, ...,uP ) (78)

we can go back to the original coordinate system and find

•|◦ y(64)= V′

Tx′ (79)

◦ (58)= V′

TUTx (80)

• (77)= VTx , (81)

which we know has maximal variance. Thus, if we start from the original data x we would set vp := up.

The variance of y is

•|◦ 〈yTy〉 (70)=

I∑i=1

λi

P∑p=1

(v′ip)2 (82)

◦ (74)=

I∑i=1

λi

P∑p=1

δ2ip (83)

• =

P∑i=1

λi , (84)

which is the sum over the first P largest eigenvalues of the covariance matrix. Likewise one can determinethe reconstruction error as

•|◦ E(38)= 〈xTx〉 − 〈yTy〉 (85)

◦ (57, 84)=

I∑i=1

λi −P∑j=1

λj (86)

• =

I∑i=P+1

λi . (87)

Notice that this is just one optimal set of weights. We have seen above that the projected data, likeany multi-dimensional data, can be rotated arbitrarily without changing its variance and thereforewithout changing its reconstruction error. This is equivalent to a rotation of the projection vectors vp withinthe space spanned by the first eigenvectors.

2.16 PCA Algorithm

2.17 Intuition of the Results

Eigenvalue spectrum.

Projection onto a low-dimensional eigenspace.

Visualization of eigenvectors.

13

2.18 Whitening or sphering

Sometimes it is desirable to transform a data set such that it has variance one in all directions.Such a normalization operation is called whitening or sphering , see Fig. 7. The latter term is quiteintuitive, because a spherical data distribution has the same variance in all directions. Intuitively speakingsphering requires to stretch and compress the data distribution along the axes of the principal componentssuch they have variance one. Technically speaking one first rotates the data into a coordinate system wherethe covariance matrix is diagonal, then performs the stretching along the axes, and then rotates the databack into the original coordinate system. Principal component analysis obviously gives all the requiredinformation. The eigenvectors of the covariance matrix provide the axes of the new coordinate system andthe eigenvalues λi indicate the variances and therefore how much one has to stretch the data. If the originalvariance is λi then one obviously has to stretch by a factor of 1/

√λi to get variance one. Thus, sphering

is achieved by multiplying the data with a sphering matrix

• W := U diag

(1√λ1,

1√λ2, ...,

1√λI

)UT (88)

• x := Wx . (89)

If the final orientation of the data does not matter, the sphering matrix is often defined without the firstU. It is easy to verify that the sphering matrix is symmetrical, the sphered data x have a unitcovariance matrix,

Cx := 〈xxT 〉 (90)(89)= W〈xxT 〉WT (91)

(39,88)= U diag

(1√λi

)UT Cx U diag

(1√λi

)UT (92)

(48)= U diag

(1√λi

)Λ diag

(1√λi

)UT (93)

(44)= U 1 UT (94)(46)= 1 , (95)

and they have variance one in all directions, since for any projection vector n of norm one the variance〈(nT x)2〉 of the projected data is

〈(nT x)2〉 = nT 〈xxT 〉n (96)(95)= nTn (97)

= 1 (98)

Similarly one can show that the sphered data projected onto two orthogonal vectors are uncorre-lated.

2.19 Singular value decomposition +

Sometimes one has fewer data points than dimensions. For instance one might have 100 images with 10000pixels each. Then doing direct PCA is very inefficient and the following method, known as singular valuedecomposition (SVD), is helpful.

Let xµ, µ = 1, ...,M be the I-dimensional data with M < I. For convenience we write the data in oneI ×M -matrix

X := (x1, ...,xM ) . (99)

The second-moment matrix can then be written as

C1 := XXT /M , (100)

14

whitening

Figure 7: Whitening of some data. The mean is removed and the data stretched such that it has varianceone in all directions.© CC BY-SA 4.0

and its eigenvalue equation and decomposition read

C1U1 = U1Λ1 (101)

⇐⇒ C1 = U1Λ1UT1 . (102)

The data represented in the coordinate system of the eigenvectors is

Y1 := UT1 X , (103)

which is still high-dimensional.

Now interpret the data matrix X transposed, i.e. swap the data point index for the dimension index. In ourexample this would correspond to having 10000 data points in a 100-dimensional space, which is, of course,much easier to deal with. We get the same equations as above just with X transposed.

C2 := XTX/I , (104)

C2U2 = U2Λ2 (105)

⇐⇒ C2 = U2Λ2UT2 , (106)

Y2 := UT2 XT . (107)

The interesting property of matrix Y2 now is that its rows are eigenvectors of matrix C1, as can be showneasily.

C1YT2

(100,107)= (XXT /M)(XU2) (108)

= X(XTX/I)U2I/M (109)(104)= XC2U2I/M (110)

(105)= XU2Λ2I/M (111)

(107)= YT

2 Λ2I/M . (112)

The corresponding eigenvalues are eigenvalues of C2 scaled by I/M .

However, Y2 yields only M eigenvectors and eigenvalues. The other eigenvalues are all zero, because Mdata points can only produce M non-zero variance dimensions or, in other words, M data points togetherwith the origin can only span an M -dimensional subspace. The missing (I −M) eigenvectors must all beorthogonal to the first M ones and orthogonal to each other but can otherwise be quite arbitrary, since theireigenvalues are all equal. A Gram-Schmidt orthogonalization procedure can be used to generate them.

15


3 Application

3.1 Face processing

Eigenfaces

(http://www-white.media.mit.edu/ 2002-12-17,© unclear)

If one carefully shifts andscales many face images suchthat the eyes are in regis-ter, i.e. at identical positions,and performs PCA (or SVD)on them, meaningful eigenvec-tors can be calculated and arecalled eigenfaces. These arethe principal grey value vari-ations that distinguish facesfrom each other. The fig-ure shows first the mean andthen the eigenfaces ordered inrows. The first eigenface ob-viously accounts for the meangrey value, the second one forthe difference in color betweenhair and face, the third onefor illumination from the side,the fourth and seventh oneat least partially for a beard.The higher components be-come increasingly difficult tointerpret. The projection of

face images onto the first eigenfaces is a suitable representation for face recognition, (cf. Turk and Pentland,1991).Figure: (http://www-white.media.mit.edu/ 2002-12-17)1 © unclear.

Eigenfaces - Texture

(Lanitis, Taylor, et al., 1995, Image and Vision Computing 13(5):393–401, Fig. 11,© non-free)

In this example many faceswere warped to a standard ge-ometry and then PCA was ap-plied to calculate the eigenvec-tors, which are called eigen-faces. The Figure visualizesthe first four eigenfaces byvarying the average face (mid-dle) along the direction of theeigenvectors by up to ±3 stan-dard deviations. The firsteigenface (mode) accounts foroverall illumination, the otherthree for some combination ofbeard, gender, and mimic ex-pression.Figure: (Lanitis et al., 1995,Fig. 11)2 © non-free.

16

http://www-white.media.mit.edu/vismod/demos/facerec/basic.html

https://www.researchgate.net/profile/Andreas_Lanitis/publication/223308678_Automatic_face_identification_system_using_flexible_appearance_models/links/00463533817d5b2dd9000000.pdf

Eigenfaces - Shape

(Lanitis, Taylor, et al., 1995, Image and Vision Computing 13(5):393–401, Fig. 6,© non-free)

In this example PCA was ap-plied to the geometry of faces.A graph with some standardstructure was mapped ontomany faces and the concate-nated vector of xy-positions ofthe nodes (not shown in thegraphs) of the graphs serve asthe data for PCA. Visualiza-tion is done relative to theaverage geometry (middle) byadding the eigenvectors up to±2 standard deviations. Thefirst three components mainlyaccount for the orientation ofthe head along the three rota-tional axes. The fourth com-ponent accounts for some vari-ation in width and mimic ex-pression.Figure: (Lanitis et al., 1995,Fig. 6)3 © non-free.

17


4 Other resources

Numbers in square brackets indicate sections of these lecture notes to which the corresponding item is related.

4.1 Written material

� PCA in Wikipediahttps://en.wikipedia.org/wiki/Principal_component_analysis

� A tutorial by Jonathon Shlenshttps://arxiv.org/pdf/1404.1100v1.pdf

� An in depth book by I.T.Jolliffehttp://wpage.unina.it/cafiero/books/pc.pdf

4.2 Visualizations

� 2D, 3D, and 17D examples of PCAhttp://setosa.io/ev/principal-component-analysis/

4.3 Videos

� Abstract conceptual introduction to PCA from Georgia Tech.

1. Part 1https://www.youtube.com/watch?v=kw9R0nD69OU (4:22) [1.1, 1.2, 1.3]

2. Part 2https://www.youtube.com/watch?v=_nZUhV-qhZA (5:27) [1.2, 1.3]

3. Part 3https://www.youtube.com/watch?v=kuzJJgPBrqc (5:01) [1.2, 1.3]

� Introduction to PCA with several practical examples by Rasmus Bro from the University of CopenhagenNote: In these videos the principal components are called loadings, and the projected values are calledscores.

1. Conceptual introductionhttps://www.youtube.com/watch?v=UUxIXU_Ob6E (12:32)

– 00:00–00:38 Introduction

– 00:38–07:52 Examples of multi-variate data

– 07:52–12:32 Concept of PCA and bi-plot by means of a simple 3D example

2. Example: Analysis of demographic data of many countries in the world with PCAhttps://www.youtube.com/watch?v=26YhtSJi1qc (11:36)

– 00:11–02:26 Outlier detection

– 02:26–05:47 Loadings help interpreting the projected data dimensions

– 05:47–08:11 Exploration after removing outliers

– 08:11–09:40 Using external information as markers

– 09:40–10:36 From plots to numbers

– 10:36–11:36 Summary

3. A bit more in depth discussionhttps://www.youtube.com/watch?v=2s-a62zSWL4 (14:31)

– 00:00–00:50 Introduction and historical remarks

– 00:50–03:18 Basic equation for data generation

– 03:18–06:32 Example: Three-variable data set

18

https://en.wikipedia.org/wiki/Principal_component_analysis

https://arxiv.org/pdf/1404.1100v1.pdf

http://wpage.unina.it/cafiero/books/pc.pdf

http://setosa.io/ev/principal-component-analysis/

https://www.youtube.com/watch?v=kw9R0nD69OU

https://www.youtube.com/watch?v=_nZUhV-qhZA

https://www.youtube.com/watch?v=kuzJJgPBrqc

https://www.youtube.com/watch?v=UUxIXU_Ob6E

https://www.youtube.com/watch?v=26YhtSJi1qc

https://www.youtube.com/watch?v=2s-a62zSWL4

* 04:30–05:40 Removing the mean before PCA

* 05:40–06:11 PCA as variations along loadings

* 06:11–06:32 PCA as a rotation and truncation

– 06:32–06:59 Scores are projections of the data onto the loadings

– 06:59–14:26 PCA as finding common profiles and profile variations

* 07:12–11:58 Example 1: Continuous spectral data of sugar samples

* 12:22–14:26 Example 2: Discrete physical and chemical measurements of the same sugarsamples

· 12:48–14:26 Scaling the variables before PCA (covariance- vs. correlation-matrix)

4. Continuation of the previous videohttps://www.youtube.com/watch?v=sRsdF3rcAJc (08:22)

– 00:05–14:26 PCA as finding common profiles and profile variations (cont.)

* 00:05–04:02 Example 2: Discrete physical and chemical measurements of the same sugarsamples (cont.)

* 01:31–02:37 Relating results from Examples 1 and 2

* 02:37–04:02 Bi-plot: Understanding the data by visualizing the loadings

– 04:02–08:18 PCA reviewed

� Introduction to PCA in Geoscience by Matthew E. Clapham from the UC Santa Cruzhttps://www.youtube.com/watch?v=TSYL-oHx4T0 (15:27)

– 00:00–00:33 Introduction

– 00:33–01:17 Data is often multi-variate / high-dimensional

– 01:17–02:50 Covariation allows to reduce dimensionality and remove redundancy

– 02:50–03:36 skip (Indirect gradient analysis)

– 03:36–04:44 skip (Types of ordination methods)

– 04:44–06:10 Idea of PCA

– 06:10–06:29 Covariance matrix

– 06:29–09:22 Eigenvectors define a new coordinate system ordered by variance

* 07:53–08:43 Eigenvalues measure the amount of variance along the axes of the new coordinatesystem

– 09:22–10:42 Loadings

– 10:42–12:17 Covariance- vs. correlation-matrix (scaling the variables before PCA)

– 12:17–13:50 Using the eigenvalue spectrum to decide how many PCs to keep

– 13:50–15:27 When is PCA applicable?

� Three lectures on singular value decomposition and PCA(See also the closely related tutorial by Jonathon Shlens https://arxiv.org/pdf/1404.1100v1.pdf.)

1. Singular value decompositionhttps://www.youtube.com/watch?v=EokL7E6o1AE (44:35)Explains singular value decomposition in general, which is quite an interesting fact in linear algebra,but not really necessary here. It is also not entirely obvious how this relates to Section 2.19. I listthis here, because it belongs to the series.

2. PCA in relatively simple mathematical termshttps://www.youtube.com/watch?v=a9jdQGybYmE (51:12)

– 00:00–15:38 Why would you like to do PCA? [1.1]

– 15:38–23:32 Variance and covariance [1.4]

* 18:18–18:30 The vectors introduced 16:52-17:10 are row vectors, so that abT is indeed aninner product. I always use column vectors, and then I write the inner product as aTb

19

https://www.youtube.com/watch?v=sRsdF3rcAJc

https://www.youtube.com/watch?v=TSYL-oHx4T0

https://arxiv.org/pdf/1404.1100v1.pdf

https://www.youtube.com/watch?v=EokL7E6o1AE

https://www.youtube.com/watch?v=a9jdQGybYmE

* 22:26–22:54 The statement made here is not always true. Variables can be uncorrelated butstatistically dependent, a simple example being cos(t) and sin(t) for t ∈ [0, 2π]. He uses theterm stastistical independence for what really only is zero covariance or no correlation.

– 23:32–25:06 Relation of variance and covariance to the motivational example

– 25:06–35:49 Covariance matrix (assuming zero mean data) [2.8]

– 35:49–39:44 Diagonalizing the covariance matrix [1.6]

– 39:44–47:20 Diagonalization with eigenvalues and -vectors [2.11]

– 47:20–49:43 Diagonalization with singular value decomposition

– 49:43–51:12 Summary

3. Application of PCA to face recognitionhttps://www.youtube.com/watch?v=8BTv-KZ2Bh8 (48:02)

4.4 Software

� General list of software for PCAhttps://en.wikipedia.org/wiki/Principal_component_analysis#Software.2Fsource_code

� PCA and variants thereof in scikit-learn, a python library for machine learninghttp://scikit-learn.org/stable/modules/decomposition.html#principal-component-analysis-pca

� Examples using PCA in scikit-learnhttp://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#

sklearn.decomposition.PCA

4.5 Exercises

� Analytical exercises by Laurenz Wiskotthttps://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/PrincipalComponentAnalysis-ExercisesPublic.

pdf

https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/PrincipalComponentAnalysis-SolutionsPublic.

pdf

� Python exercises by Laurenz Wiskotthttps://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/PrincipalComponentAnalysis-PythonExercisesPublic.

zip

https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/PrincipalComponentAnalysis-PythonSolutionsPublic.

zip

5 Acknowledgment

I thank Agnieszka Grabska-Barwinska for working out the proof for singular value decomposition.

References

Hertz, J., Krogh, A., and Palmer, R. G. (1991). Introduction to the Theory of Neural Computation. Addison-Wesley, Redwood City, CA. 2

Lanitis, A., Taylor, C. J., and Cootes, T. F. (1995). An automatic face identification system using flexibleappearance models. Image and Vision Computing, 13(5):393–401. 16, 17

Turk, M. and Pentland, A. P. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1):71–86. 16

20

https://www.youtube.com/watch?v=8BTv-KZ2Bh8

https://en.wikipedia.org/wiki/Principal_component_analysis#Software.2Fsource_code

http://scikit-learn.org/stable/modules/decomposition.html#principal-component-analysis-pca

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA

https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/PrincipalComponentAnalysis-ExercisesPublic.pdf

https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/PrincipalComponentAnalysis-ExercisesPublic.pdf

https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/PrincipalComponentAnalysis-SolutionsPublic.pdf

https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/PrincipalComponentAnalysis-SolutionsPublic.pdf

https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/PrincipalComponentAnalysis-PythonExercisesPublic.zip

https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/PrincipalComponentAnalysis-PythonExercisesPublic.zip

https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/PrincipalComponentAnalysis-PythonSolutionsPublic.zip

https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/PrincipalComponentAnalysis-PythonSolutionsPublic.zip

Notes

1http://www-white.media.mit.edu/ 2002-12-17, © unclear, http://www-white.media.mit.edu/vismod/demos/facerec/

basic.html

2Lanitis, Taylor, et al., 1995, Image and Vision Computing 13(5):393–401, Fig. 11, © non-free, https:

//www.researchgate.net/profile/Andreas_Lanitis/publication/223308678_Automatic_face_identification_system_

using_flexible_appearance_models/links/00463533817d5b2dd9000000.pdf

3Lanitis, Taylor, et al., 1995, Image and Vision Computing 13(5):393–401, Fig. 6, © non-free, https:

//www.researchgate.net/profile/Andreas_Lanitis/publication/223308678_Automatic_face_identification_system_

using_flexible_appearance_models/links/00463533817d5b2dd9000000.pdf

21









lecture notes on principal component analysis · lecture notes on principal component analysis ......

Documents