principal component analysis

18
Principal Component Analysis Principal component analysis is a statistical technique that is used to analyze the interrelationships among a large number of variables and to explain these variables in terms of a smaller number of variables, called principal components, with a minimum loss of information. Definition 1: Let X = [xi] be any k × 1 random vector. We now define a k × 1 vector Y = [yi], where for each i the ith principal component of X is for some regression coefficients βij. Since each yi is a linear combination of the xj, Y is a random vector. Now define the k × k coefficient matrix β = [βij] whose rows are the 1 × k vectors = [βij]. Thus, yi = Y = For reasons that will be become apparent shortly, we choose to view the rows of β as column vectors βi, and so the rows themselves are the transpose . Observation: Let Σ = [σij] be the k ×k population covariance matrix for X. Then the covariance matrix for Y is given by ΣY = β T Σ β i.e. population variances and covariances of the yi are given by Observation: Our objective is to choose values for the regression coefficients βij so as to maximize var(yi) subject to the constraint that cov(yi, yj) = 0 for all i = j. We find such coefficients βij using the Spectral Decomposition Theorem (Theorem 1 of Linear Algebra Background ). Since the covariance matrix is symmetric, by Theorem 1 of Symmetric Matrices , it follows that Σ = β D β T where β is a k × k matrix whose columns are unit eigenvectors β1, …, βk corresponding to the eigenvalues λ1,…, λk of Σ and D is the k×k diagonal matrix whose main diagonal consists of λ1, …, λk. Alternatively, the spectral theorem can be expressed as

Upload: ardo-ramdhani

Post on 18-Jan-2016

7 views

Category:

Documents


0 download

DESCRIPTION

teknik

TRANSCRIPT

Page 1: Principal Component Analysis

Principal Component AnalysisPrincipal component analysis is a statistical technique that is used to

analyze the interrelationships among a large number of variables and to

explain these variables in terms of a smaller number of variables, called

principal components, with a minimum loss of information.

Definition 1: Let X = [xi] be any k × 1 random vector. We now define

a k × 1 vector Y = [yi], where for each i the ith principal

component of X is

for some regression coefficients βij. Since each yi is a linear combination

of the xj, Y is a random vector.

Now define the k × k coefficient matrix β = [βij] whose rows are the 1

× k vectors   = [βij]. Thus,

yi =                       Y = 

For reasons that will be become apparent shortly, we choose to view the

rows of β as column vectors βi, and so the rows themselves are the

transpose  .

Observation:  Let Σ = [σij] be the k × k population covariance matrix

for X. Then the covariance matrix for Y is given by

ΣY = βT Σ β

i.e. population variances and covariances of the yi are given by

Observation: Our objective is to choose values for the regression

coefficients βij so as to maximize var(yi) subject to the constraint that

cov(yi, yj) = 0 for all i ≠ j. We find such coefficients βij using the Spectral

Decomposition Theorem (Theorem 1 of Linear Algebra Background).

Since the covariance matrix is symmetric, by Theorem 1 of Symmetric

Matrices, it follows that

Σ = β D βT

where β is a k × k matrix whose columns are unit eigenvectors β1, …,

βk corresponding to the eigenvalues λ1, …, λk of Σ and D is

the k × k diagonal matrix whose main diagonal consists of λ1, …, λk.

Alternatively, the spectral theorem can be expressed as

Property 1: If λ1 ≥ … ≥ λk are the eigenvalues of Σ with corresponding

unit eigenvectorsβ1, …, βk, then

Page 2: Principal Component Analysis

and furthermore, for all i and j ≠ i

var(yi) = λi         cov(yi, yj) = 0

Proof: The first statement results from Theorem 11 as explained above.

Since the column vectors βj are orthonormal, βi  · βj =   = 0 if j ≠

i and   = 1 if  j = i. Thus

Property 2:

Proof: By definition of the covariance matrix, the main diagonal

of Σ contains the values  , …,  , and so trace(Σ) =  . But by

Property 1 of Eigenvalues and Eigenvectors, trace(Σ) =  .

Observation: Thus the total variance   for X can be expressed as

trace(Σ) =  , but by Property 1, this is also the total variance for Y.

Thus the portion of the total variance (of X or Y) explained by the ith

principal component yi is λi/ . Assuming that λ1 ≥ … ≥ λk the portion

of the total variance explained by the first m principal components is

therefore   /  .

Our goal is to find a reduced number of principal components that can

explain most of the total variance, i.e. we seek a value of m that is as low

as possible but such that the ratio    /   is close to 1.

Observation: Since the population covariance Σ is unknown, we will use

the sample covariance matrix S as an estimate and proceed as above

using S in place of Σ. Recall thatS is given by the formula:

where we now consider X = [xij] to be a k × n matrix such that for each i,

{xij: 1 ≤ j ≤ n} is a random sample for random variable xi. Since the

sample covariance matrix is symmetric, there is a similar spectral

decomposition

Page 3: Principal Component Analysis

where the Bj = [bij] are the unit eigenvectors of S corresponding to the

eigenvalues λj of S(actually this is a bit of an abuse of notation since

these λj are not the same as the eigenvalues of Σ).

We now use bij as the regression coefficients and so have

and as above, for all i and  j ≠ i

var(yi) = λi         cov(yi, yj) = 0

As before, assuming that λ1 ≥ … ≥ λk, we want to find a value of m so

that   explains as much of the total variance as possible. In this way

we reduce the number of principal components needed to explain most of

the variance.

Example 1: The school system of a major city wanted to determine the

characteristics of a great teacher, and so they asked 120 students to rate

the importance of each of the following 9 criteria using a Likert scale of 1

to 10 with 10 representing that a particular characteristic is extremely

important and 1 representing that the characteristic is not important.

1. Setting high expectations for the students

2. Entertaining

3. Able to communicate effectvely

4. Having expertise in their subject

5. Able to motivate

6. Caring

7. Charismatic

8. Having a passion for teaching

9. Friendly and easy-going

Figure 1 shows the scores from the first 10 students in the sample and

Figure 2 shows some descriptive statistics about the entire 120 person

sample.

Page 4: Principal Component Analysis

Figure 1 – Teacher evaluation scores

Figure 2 – Descriptive statistics for teacher evaluations

The sample covariance matrix S is shown in Figure 3 and can be

calculated directly as

=MMULT(TRANSPOSE(B4:J123-B126:J126),B4:J123-B126;J126)/

(COUNT(B4:B123)-1)

Here B4:J123 is the range containing all the evaluation scores and

B126:J126 is the range containing the means for each criterion.

Alternatively we can simply use the Real Statistics supplemental function

COV(B4:J123) to produce the same result.

Figure 3 – Covariance Matrix

Page 5: Principal Component Analysis

In practice, we usually prefer to standardize the sample scores. This will

make the weights of the nine criteria equal. This is equivalent to using

the correlation matrix. Let R = [rij] where rij is the correlation

between xi and xj, i.e.

The sample correlation matrix R is shown in Figure 4 and can be

calculated directly as

=MMULT(TRANSPOSE((B4:J123-B126:J126)/B127:J127),(B4:J123-

B126;J126)/B127:J127)/(COUNT(B4:B123)-1)

Here B127:J127 is the range containing the standard deviations for each

criterion. Alternatively we can simply use the Real Statistics

supplemental function CORR(B4:J123) to produce the same result.

Figure 4 – Correlation Matrix

Note that all the values on the main diagonal are 1, as we would expect

since the variances have been standardized. We next calculate the

eigenvalues for the correlation matrix using the eVECTORS(M4:U12)

supplemental function, as described in Linear Algebra Background. The

result appears in range M18:U27 of Figure 5.

Figure 5 – Eigenvalues and eigenvectors of the correlation matrix

Page 6: Principal Component Analysis

The first row in Figure 5 contains the eigenvalues for the correlation

matrix in Figure 4. Below each eigenvalue is a corresponding unit

eigenvector. E.g. the largest eigenvalue is λ1= 2.880437. Corresponding

to this eigenvalue is the 9 × 1 column eigenvector B1 whose elements are

0.108673, -0.41156, etc.

As we described above, coefficients of the eigenvectors serve as the

regression coefficients of the 9 principal components. For example the

first principal component can be expressed by

i.e.

Thus for any set of scores (for the xj) you can calculate each of the

corresponding principal components. Keep in mind that you need to

standardize the values of the xj first since this is how the correlation

matrix was obtained. For the first sample (row 4 of Figure 1), we can

calculate the nine principal components using the matrix equation Y =

BX′ as shown in Figure 6.

Figure 6 – Calculation of PC1 for first sample

Here B (range AI61:AQ69) is the set of eigenvectors from Figure

5, X (range AS61:AS69) is simply the transpose of row 4 from Figure 1, X

′ (range AU61:AU69) standardizes the scores in X (e.g. cell AU61

contains the formula =STANDARDIZE(AS61, B126, B127), referring to

Figure 2) and Y (range AW61:AW69) is calculated by

=MMULT(TRANSPOSE(AI61:AQ69),AU61:AU69). Thus the principal

components values corresponding to the first sample are 0.782502 (PC1),

-1.9758 (PC2), etc.

As observed previously, the total variance for the nine random variables

is 9 (since the variance was standardized to 1 in the correlation matrix),

which is, as expected, equal to the sum of the nine eigenvalues listed in

Figure 5. In fact, in Figure 7 we list the eigenvalues in decreasing order

Page 7: Principal Component Analysis

and show the percentage of the total variance accounted for by that

eigenvalue.

Figure 7 – Variance accounted for by each eigenvalue

The values in column M are simply the eigenvalues listed in the first row

of Figure 5, with cell M41 containing the formula =SUM(M32:M40) and

producing the value 9 as expected. Each cell in column N contains the

percentage of the variance accounted for by the corresponding

eigenvalue. E.g. cell N32 contains the formula =M32/M41, and so we see

that 32% of the total variance is accounted for by the largest eigenvalue.

Column O simply contains the cummulative weights, and so we see that

the first four eigenvalues accounts for 72.3% of the variance.

Using Excel’s charting capability, we can plot the values in column N of

Figure 7 to obtain a graphical representation, called a scree plot.

Page 8: Principal Component Analysis

Figure 8 – Scree Plot

We decide to retain the first four eigenvalues, which explain 72.3% of the

variance. In section Basic Concepts of Factor Analysis we will explain in

more detail how to determine how many eigenvalues to retain. The

portion of the Figure 5 that refers to these eigenvalues is shown in

Figure 9. Since all but the Expect value for PC1 is negative, we first

decide to negate all the values. This is not a problem since the negative

of a unit eigenvector is also a unit eigenvector.

Figure 9 – Principal component coefficients (Reduced Model)

Those values that are sufficiently large, i.e. the values that show a high

correlation between the principal components and the (standardized)

original variables, are highlighted. We use a threshold of ±0.4 for this

purpose.

This is done by highlighting the range R32:U40 and selecting Home >

Styles|Conditional Formatting and then choosing Highlight Cell

Rules > Greater Than and inserting the value .4 and then

selecting Home > Styles|Conditional Formatting and then

choosing Highlight Cell Rules > Less Than and inserting the value -.4.

Note that Entertainment, Communications, Charisma and Passion are

highly correlated with PC1, Motivation and Caring are highly correlated

with PC3 and Expertise is highly correlated with PC4. Also Expectation is

highly positively correlated with PC2 while Friendly is negatively

correlated with PC2.

Page 9: Principal Component Analysis

Ideally we would like to see that each variable is highly correlated with

only one principal component. As we can see form Figure 9, this is the

case in our example. Usually this is not the case, however, and we will

show what to do about this in the Basic Concepts of Factor Analysis when

we discuss rotation in Factor Analysis.

In our analysis we retain 4 of the 9 principal factors. As noted previously,

each of the principal components can be calculated by

i.e. Y= BTX′, where Y is a k × 1 vector of principal

components, B is a k x k matrix (whose columns are the unit

eigenvectors) and X′ is a k × 1 vector of the standardized scores for the

original variables.

If we retain only m principal components, then Y = BTX where Y is an m ×

1 vector, B is ak × m matrix (consisting of the m unit eigenvectors

corresponding to the m largest eigenvalues) and X′ is the k  × 1 vector of

standardized scores as before. The interesting thing is that if Y is known

we can calculate estimates for standardized values for X using the fact

that X′ = BBTX’ = B(BTX′) = BY (since B is an orthogonal matrix, and

so, BBT = I). From X′ it is then easy to calculate X.

Figure 10 – Estimate of original scores using reduced model

In Figure 10 we show how this is done using the four principal

components that we calculated from the first sample in Figure

6. B (range AN74;AQ82) is the reduced set of coefficients (Figure

9), Y (range AS74:AS77) are the principal components as calculated in

Figure 6, X′ are the estimated standardized values for the first sample

(range AU74:AU82) using the formula =MMULT(AN74:AQ82,AS74:AS77)

and finally X are the estimated scores in the first sample (range

Page 10: Principal Component Analysis

AW74:AW82) using the formula

=AU74:AU82*TRANSPOSE(B127:J127)+TRANSPOSE(B126:J126).

As you can see the values for X in Figure 10 are similar, but not exactly

the same as the values for X in Figure 6, demonstrating both the

effectiveness as well as the limitations of the reduced principal

component model (at least for this sample data).

CovarianceFrom Wikipedia, the free encyclopedia

This article is about the measure of linear relation between random variables. For other uses, see Covariance (disambiguation).

In probability theory and statistics, covariance is a measure of how much two random variables change together. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the smaller values, i.e., the variables tend to show similar behavior, the covariance is positive.[1] In the opposite case, when the greater values of one variable mainly correspond to the smaller values of the other, i.e., the variables tend to show opposite behavior, the covariance is negative. The sign of the covariance therefore shows the tendency in the linear relationship between the variables. The magnitude of the covariance is not easy to interpret. The normalized version of the covariance, the correlation coefficient, however, shows by its magnitude the strength of the linear relation.

A distinction must be made between (1) the covariance of two random variables, which is a population parameter that can be seen as a property of the joint probability distribution, and (2) the sample covariance, which serves as an estimated value of the parameter.

Contents

  [hide] 

1   Definition 2   Properties

o 2.1   A more general identity for covariance matrices o 2.2   Uncorrelatedness and independence o 2.3   Relationship to inner products

3   Calculating the sample covariance 4   Comments 5   Applications

o 5.1   In genetics and molecular biology o 5.2   In financial economics o 5.3   In meteorological data assimilation

6   See also 7   References 8   External links

Definition[edit]

The covariance between two jointly distributed real-valued random variables X and Y with finite second moments is defined as[2]

Page 11: Principal Component Analysis

where E[X] is the expected value of X, also known as the mean of X. By using the linearity property of expectations, this can be simplified to

However, when  , this last equation is prone to catastrophic cancellation when computed with floating point arithmetic and thus should be avoided in computer programs when the data have not been centered before.[3]

For random vectors   and   (both of dimension m) the m×m cross covariance matrix (also known as dispersion matrix or variance–covariance matrix,[4] or simply calledcovariance matrix) is equal to

where mT is the transpose of the vector (or matrix) m.

The (i,j)-th element of this matrix is equal to the covariance Cov(Xi, Yj) between the i-th scalar component of X and the j-th scalar component of Y. In particular, Cov(Y, X) is thetranspose of Cov(X, Y).

For a vector   of m jointly distributed random variables with finite second moments, its covariance matrix is defined as

Random variables whose covariance is zero are called uncorrelated.

The units of measurement of the covariance Cov(X, Y) are those of X times those of Y. By contrast, correlation coefficients, which depend on the covariance, are a dimensionlessmeasure of linear dependence. (In fact, correlation coefficients can simply be understood as a normalized version of covariance.)

Properties[edit]

Variance  is a special case of the covariance when the two variables are identical:

If X, Y, W, and V are real-valued random variables and a, b, c, d are constant ("constant" in this context means non-random), then the following facts are a consequence of the definition of covariance:

Page 12: Principal Component Analysis

For a sequence X1, ..., Xn of random variables, and constants a1, ..., an, we have

A more general identity for covariance matrices[edit]

Let   be a random vector with covariance matrix  , and

let   be a matrix that can act on  . The covariance matrix of the

vector   is:

.

This is a direct result of the linearity of expectation and is useful when applying a linear transformation, such as a whitening transformation, to a vector.

Uncorrelatedness and independence[edit]

If X and Y are independent, then their covariance is zero. This follows because under independence,

The converse, however, is not generally true. For example, let X be uniformly distributed in [-1, 1] and let Y = X2. Clearly, X and Y are dependent, but

In this case, the relationship between Y and X is non-linear, while correlation and covariance are measures of linear dependence between two variables. This example shows that if two variables are uncorrelated, that does not in general imply that they are independent. However, if two variables are jointly normally distributed (but not if they are merelyindividually normally distributed), uncorrelatedness does imply independence.

Page 13: Principal Component Analysis

Relationship to inner products[edit]

Many of the properties of covariance can be extracted elegantly by observing that it satisfies similar properties to those of an inner product:

1. bilinear : for constants a and b and random variables X, Y, Z, σ(aX + bY, Z) = a σ(X, Z) + b σ(Y, Z);

2. symmetric: σ(X, Y) = σ(Y, X);3. positive semi-definite : σ2(X) = σ(X, X) ≥ 0 for all

random variables X, and σ(X, X) = 0 implies that X is a constant random variable (K).

In fact these properties imply that the covariance defines an inner product over the quotient vector space obtained by taking the subspace of random variables with finite second moment and identifying any two that differ by a constant. (This identification turns the positive semi-definiteness above into positive definiteness.) That quotient vector space is isomorphic to the subspace of random variables with finite second moment and mean zero; on that subspace, the covariance is exactly the L 2  inner product of real-valued functions on the sample space.

As a result for random variables with finite variance, the inequality

holds via the Cauchy–Schwarz inequality.

Proof: If σ2(Y) = 0, then it holds trivially. Otherwise, let random variable

Then we have

Calculating the sample covariance[edit]

Main article: Sample mean and sample covariance

The sample covariance of N observations of K variables is the K-by-

K matrix   with the entries

Page 14: Principal Component Analysis

,

which is an estimate of the covariance between variable j and variable k.

The sample mean and the sample covariance matrix are unbiased estimates of the mean and the covariance matrix of the random

vector  , a row vector whose jth element (j = 1, ..., K) is one of the random variables. The reason the

sample covariance matrix has   in the denominator rather than   is essentially that the population

mean  is not known and is

replaced by the sample mean  . If

the population mean   is known, the analogous unbiased estimate is given by

Comments[edit]

The covariance is sometimes called a measure of "linear dependence" between the two random variables. That does not mean the same thing as in the context of linear algebra(see linear dependence). When the covariance is normalized, one obtains the correlation coefficient. From it, one can obtain the Pearson coefficient, which gives the goodness of the fit for the best possible linear function describing the relation between the variables. In this sense covariance is a linear gauge of dependence.

Applications[edit]

In genetics and molecular biology[edit]

Covariance is an important measure in biology. Certain sequences of DNA are conserved more than others among species, and thus to study secondary and

Page 15: Principal Component Analysis

tertiary structures of proteins, or of RNA structures, sequences are compared in closely related species. If sequence changes are found or no changes at all are found in noncoding RNA (such as microRNA), sequences are found to be necessary for common structural motifs, such as an RNA loop.

In financial economics[edit]

Covariances play a key role in financial economics, especially in portfolio theory and in the capital asset pricing model. Covariances among various assets' returns are used to determine, under certain assumptions, the relative amounts of different assets that investors should (in a normative analysis) or are predicted to (in a positive analysis) choose to hold in a context of diversification.

In meteorological data assimilation[edit]

The covariance matrix is important in estimating the initial conditions required for running weather forecast models.