applied multivariate analysis - vaasan yliopistolipas.uwasa.fi/~sjp/teaching/mva/lectures/c4.pdfpca...

20
Applied Multivariate Analysis Seppo Pynn¨ onen Department of Mathematics and Statistics, University of Vaasa, Finland Spring 2017 Seppo Pynn¨ onen Applied Multivariate Analysis

Upload: others

Post on 14-Jan-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Applied Multivariate Analysis - Vaasan yliopistolipas.uwasa.fi/~sjp/Teaching/mva/lectures/c4.pdfPCA is not a statistical model. It is merely a linear transformation of original variables

Applied Multivariate Analysis

Seppo Pynnonen

Department of Mathematics and Statistics, University of Vaasa, Finland

Spring 2017

Seppo Pynnonen Applied Multivariate Analysis

Page 2: Applied Multivariate Analysis - Vaasan yliopistolipas.uwasa.fi/~sjp/Teaching/mva/lectures/c4.pdfPCA is not a statistical model. It is merely a linear transformation of original variables

Principal Component Analysis

Dimension reduction

Principal Component Analysis (PCA)

Seppo Pynnonen Applied Multivariate Analysis

Page 3: Applied Multivariate Analysis - Vaasan yliopistolipas.uwasa.fi/~sjp/Teaching/mva/lectures/c4.pdfPCA is not a statistical model. It is merely a linear transformation of original variables

Principal Component Analysis

The problem in exploratory multivariate data analysis usually is thelarge number of variables.

Consentration of the number of variables to fewer new variables isone form of data reduction.

Major tools in this process is principal component analysis (PCA)and exploratory factor analysis (FA).

PCA is a technical transformation and FA is model based.

Seppo Pynnonen Applied Multivariate Analysis

Page 4: Applied Multivariate Analysis - Vaasan yliopistolipas.uwasa.fi/~sjp/Teaching/mva/lectures/c4.pdfPCA is not a statistical model. It is merely a linear transformation of original variables

Principal Component Analysis

The aim in PCA is to replace the original variables, x1, x2, . . . , xp,by few new variables, y1, . . . , yk , that are linear combinations ofthe x-variables, preserve essentially all the information in thex-variables, and are uncorrelated with each other.

Seppo Pynnonen Applied Multivariate Analysis

Page 5: Applied Multivariate Analysis - Vaasan yliopistolipas.uwasa.fi/~sjp/Teaching/mva/lectures/c4.pdfPCA is not a statistical model. It is merely a linear transformation of original variables

Principal Component Analysis

More formally:

The first principal component is

y1 = a11x1 + a12x2 + · · ·+ a1pxp, (1)

where the coefficients, a1j (j = 1, . . . , p) are defined such that

var[y1] = max(a11,...,a1p)

var[a11x1 + · · ·+ a1pxp] (2)

under the restriction (scaling constraint)

a211 + · · ·+ a21p = 1. (3)

Seppo Pynnonen Applied Multivariate Analysis

Page 6: Applied Multivariate Analysis - Vaasan yliopistolipas.uwasa.fi/~sjp/Teaching/mva/lectures/c4.pdfPCA is not a statistical model. It is merely a linear transformation of original variables

Principal Component Analysis

The second principal component is

y2 = a21x1 + a22x2 + · · ·+ a2pxp (4)

with a2j defined such that

var[y2] = max(a21,...,a2p)

var[a21x1 + · · ·+ a2pxp], (5)

a221 + · · ·+ a22p = 1, (6)

andcov[y1, y2] = 0. (7)

Seppo Pynnonen Applied Multivariate Analysis

Page 7: Applied Multivariate Analysis - Vaasan yliopistolipas.uwasa.fi/~sjp/Teaching/mva/lectures/c4.pdfPCA is not a statistical model. It is merely a linear transformation of original variables

Principal Component Analysis

Altogether there are p principal components, but not all of themare important.

Thus, through the principal components a set of correlatedvariables are transformed a set of uncorrelated variables.

Seppo Pynnonen Applied Multivariate Analysis

Page 8: Applied Multivariate Analysis - Vaasan yliopistolipas.uwasa.fi/~sjp/Teaching/mva/lectures/c4.pdfPCA is not a statistical model. It is merely a linear transformation of original variables

Principal Component Analysis

Mathematically the principal components are a solution of theeigenvalues of the covariance matrix of x-variables.

The coefficients of the first PC are the elements of the eigenvectorcorresponding to the largest eigenvalue, the coefficients of thesecond PC are the elements of the eigenvector of the secondlargest eigenvalue, and so on.

Remark 3.1: The principal component analysis is usually in practice

obtained from the correlation matrix rather than the covariance matrix.

Correlations are scale free, while covariances are not.

Remark 3.2: PC solution from a correlation matrix is different from that

of a covariance matrix.

Seppo Pynnonen Applied Multivariate Analysis

Page 9: Applied Multivariate Analysis - Vaasan yliopistolipas.uwasa.fi/~sjp/Teaching/mva/lectures/c4.pdfPCA is not a statistical model. It is merely a linear transformation of original variables

Principal Component Analysis

Let `i denote the ith iegenvalue of or correlation matrix (orcovariance matrix) of the the x-variables, such that`1 ≥ `2 ≥ · · · ≥ `p, then

p∑i=1

var[xi ] =

p∑i=1

`i (8)

andvar[yi ] = `i . (9)

Thus, the ith component explains

100× `i∑pj=1 var[xj ]

% (10)

of the total variance of the x-variables.

Remark 3.3: In the case of correlation matrix, the variables arestandarized with unit variance, i.e.,var[xj ] = 1 and

∑pj=1 var[xj ] = p. Thus the explanatory power of the ith

component extracted from the correlation matix is

100× `ip

%. (11)Seppo Pynnonen Applied Multivariate Analysis

Page 10: Applied Multivariate Analysis - Vaasan yliopistolipas.uwasa.fi/~sjp/Teaching/mva/lectures/c4.pdfPCA is not a statistical model. It is merely a linear transformation of original variables

Principal Component Analysis

Assuming that the components are extracted form the correlatinmatrix, correlation of the original variable xi with the componentyj are given by

corr[xi , yj ] = aji√`j , (12)

and are called loadings.

Thus, the loadings (correlations) are just scaled the eigenvectorcoefficients, but may be easier to interpret, because correlationsare between −1 and 1.

If varibales with high correlation have something common that canbe used as the basis for the naming.

Remark 3.4: If the components are extracted from the covariance matrixthe loadings are

corr[xi , yj ] = aji

√`j

si, (13)

where si is the standard deviation of xi .

Seppo Pynnonen Applied Multivariate Analysis

Page 11: Applied Multivariate Analysis - Vaasan yliopistolipas.uwasa.fi/~sjp/Teaching/mva/lectures/c4.pdfPCA is not a statistical model. It is merely a linear transformation of original variables

Principal Component Analysis

Example 1

Crime rates in the USA in 2005 per 100,000 people by states.

Source: www.fbi.gov

Violent crimes: murder and nonnegligent manslaughter, forciblerape, robbery, and aggarvated assault.Property crimes: burglary, larceny-theft, and motor vehicle

theft.

Using SAS PROC PRINCOMP, the results are:

proc princomp data = uscrime2005 out = uscrime_components;

title ’US crime rates per 100,000 population by state’;

var murder rape robbery assault burglary larceny auto;

run;

Seppo Pynnonen Applied Multivariate Analysis

Page 12: Applied Multivariate Analysis - Vaasan yliopistolipas.uwasa.fi/~sjp/Teaching/mva/lectures/c4.pdfPCA is not a statistical model. It is merely a linear transformation of original variables

Principal Component Analysis

US crime rates per 100,000 population by state, year 2005

Simple Statistics

murder rape robbery assault burglary larceny auto

Mean 5.590 33.163 114.456 265.729 685.671 2273.432 380.417

StD 5.235 11.888 96.970 147.270 234.068 553.822 250.369

Correlation Matrix

murder rape robbery assault burglary larceny auto

murder 1.0000

rape -.1131 1.0000

robbery 0.8707 -.0664 1.0000

assault 0.5456 0.4120 0.6354 1.0000

burglary 0.1966 0.3822 0.2207 0.5591 1.0000

larceny 0.0422 0.4099 0.1514 0.4671 0.6769 1.0000

auto 0.5922 0.1587 0.7063 0.5304 0.4176 0.4655 1.0000

Eigenvalues of the Correlation Matrix

Eigenvalue Difference Proportion Cumulative

1 3.49029832 1.69375898 0.4986 0.4986

2 1.79653934 1.11181912 0.2566 0.7553

3 0.68472022 0.22190874 0.0978 0.8531

4 0.46281148 0.17842339 0.0661 0.9192

5 0.28438809 0.09650985 0.0406 0.9598

6 0.18787824 0.09451393 0.0268 0.9867

7 0.09336431 0.0133 1.0000

Seppo Pynnonen Applied Multivariate Analysis

Page 13: Applied Multivariate Analysis - Vaasan yliopistolipas.uwasa.fi/~sjp/Teaching/mva/lectures/c4.pdfPCA is not a statistical model. It is merely a linear transformation of original variables

Principal Component Analysis

Eigenvectors

Prin1 Prin2 Prin3 Prin4 Prin5 Prin6 Prin7

murder 0.379 -.460 0.113 -.158 0.175 0.615 -.445

rape 0.185 0.513 0.707 0.293 0.221 0.234 0.118

robbery 0.421 -.417 0.080 0.031 -.113 0.027 0.792

assault 0.458 0.059 0.344 -.385 -.465 -.472 -.283

burglary 0.362 0.372 -.329 -.513 0.577 -.077 0.141

larceny 0.329 0.441 -.455 0.185 -.534 0.414 0.006

auto 0.441 -.118 -.217 0.665 0.273 -.407 -.248

The eigenvalues indicate that two (or three) components provide a good

summary of the data. Of the total variance 76% is accounted by the first

two components and 85% by the first three components.

Seppo Pynnonen Applied Multivariate Analysis

Page 14: Applied Multivariate Analysis - Vaasan yliopistolipas.uwasa.fi/~sjp/Teaching/mva/lectures/c4.pdfPCA is not a statistical model. It is merely a linear transformation of original variables

Principal Component Analysis

The loadings matrix for the first three components:

Principal component loadings

Prin1 Prin2 Prin3

murder 0.70725 -0.61753 0.09331

rape 0.34621 0.68746 0.58502

robbery 0.78777 -0.55883 0.06571

assault 0.85622 0.07870 0.28501

burglary 0.67661 0.49837 -0.27241

larceny 0.61551 0.59229 -0.37635

auto 0.82456 -0.15786 -0.17993

All loadings for the first component are about the same and fairly highexcept for rape. Thus, the first component describes general criminality.

The second component loads (positive) high on rape, larceny, andburglary and negative high on murder and assault. Thus thiscomponent seems to measure the preponderance of property and sexualcrime over violent crimes (other than sexual) and vice versa (sign of aneigenvector can be changed).

These kinds of components are called bipolar. Here it means that high

negative values on the component indicate high violent crime rates.Seppo Pynnonen Applied Multivariate Analysis

Page 15: Applied Multivariate Analysis - Vaasan yliopistolipas.uwasa.fi/~sjp/Teaching/mva/lectures/c4.pdfPCA is not a statistical model. It is merely a linear transformation of original variables

Principal Component Analysis

The third component is not that clear but high values of the component

indicate those states where rape and assault crimes are high while

property crimes tend to be below average. On the other hand, again high

negative value indicate high level of property crime.

Seppo Pynnonen Applied Multivariate Analysis

Page 16: Applied Multivariate Analysis - Vaasan yliopistolipas.uwasa.fi/~sjp/Teaching/mva/lectures/c4.pdfPCA is not a statistical model. It is merely a linear transformation of original variables

Principal Component Analysis

Number of components

A rule of thumb to decide the number of meaningful componentsis select those for which the eigenvalue is equal or greater then 1(e.g. SPSS uses this as an automatic rule).

Another criterion is the so called Cattell’s scree test. The rule is toretain all the eigenvalues (hence, the number of components) inthe sharp descent (before the ”elbow point”) in the plot ofeigenvalues against the their ordinal number.

Usually there is a discernible drop (break point) before theeigenvalues start to level in the plot.

Seppo Pynnonen Applied Multivariate Analysis

Page 17: Applied Multivariate Analysis - Vaasan yliopistolipas.uwasa.fi/~sjp/Teaching/mva/lectures/c4.pdfPCA is not a statistical model. It is merely a linear transformation of original variables

Principal Component Analysis

Cattell’s Scree Plot for the Crime 2005 Data

The eigenvalue criterion supports two components and the scree test two

or three. We have selected three.Seppo Pynnonen Applied Multivariate Analysis

Page 18: Applied Multivariate Analysis - Vaasan yliopistolipas.uwasa.fi/~sjp/Teaching/mva/lectures/c4.pdfPCA is not a statistical model. It is merely a linear transformation of original variables

Principal Component Analysis

Significant coefficients

The loadings (scaled component coefficients) are correlations.

It can be shown that if the population correlation is zero, thesample correlation is asymptotically normally distributed with zeromean and variance 1/(n − 1), where n is the sample size.

Using this we can use the rule that those coefficients arestatistically significant that are plus/minus two standard errorsaway from zero.

stderr = 1/√n − 1 (14)

Seppo Pynnonen Applied Multivariate Analysis

Page 19: Applied Multivariate Analysis - Vaasan yliopistolipas.uwasa.fi/~sjp/Teaching/mva/lectures/c4.pdfPCA is not a statistical model. It is merely a linear transformation of original variables

Principal Component Analysis

In the crime data n = 52, thus those coefficients are statisticallysignificant whose loadings are on absolute value larger than

2√n − 1

=2√51≈ 0.28. (15)

Thus, for the first component all the coefficients are statistically

significant, for the second all but assault and auto, and for the third

rape and larceny, while assault and burglary are on the borderline.

Seppo Pynnonen Applied Multivariate Analysis

Page 20: Applied Multivariate Analysis - Vaasan yliopistolipas.uwasa.fi/~sjp/Teaching/mva/lectures/c4.pdfPCA is not a statistical model. It is merely a linear transformation of original variables

Principal Component Analysis

Recap

The main usage of principal components are for indexes and fornew variables in subsequent studies.

PCA is not a statistical model.

It is merely a linear transformation of original variables to newvariables for the purpose of reducing the dimensionality of theproblem (concentrate information).

Seppo Pynnonen Applied Multivariate Analysis