principal component analysis, pca, in r · 2016-02-01 · pca is the eigen decomposition of xtx pca...

33
eNote 2 1 eNote 2 Principal Component Analysis, PCA, in R

Upload: others

Post on 11-Apr-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 1

eNote 2

Principal Component Analysis, PCA, in R

Page 2: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 INDHOLD 2

Indhold

2 Principal Component Analysis, PCA, in R 12.1 Reading about PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Example: Fisher’s Iris Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Data import . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.2 Basic explorative analysis . . . . . . . . . . . . . . . . . . . . . . . . 92.2.3 PCA of Iris data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Spectral data example: yarn data . . . . . . . . . . . . . . . . . . . . . . . . 242.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.1 Reading about PCA

You can use the Wehrens book, Chapter 4, pp 43-56:

http://link.springer.com.globalproxy.cvt.dk/book/10.1007/978-3-642-17841-2/

page/1

and/or (probably better) the Varmuza-book, chapter 3, sections 3.1 - 3.7:

http://www.crcnetbase.com.globalproxy.cvt.dk/isbn/978-1-4200-5947-2

The two R-packages chemometrics and ChemometricswithR, are companions to the twobooks.

Bro and Smilde (2014): Principal Component Analysis Analytical Methods TUTORIALREVIEW, 6, 2812.http://pubs.rsc.org/en/content/articlepdf/2014/ay/c3ay41907j

Page 3: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.1 READING ABOUT PCA 3

Below there will be a number of important plots examplified as part of the iris-example:

1. Variance-plots (”scree-type”plots)

2. Scores, loadings and biplots (main plots for interpretation of structure)

3. Explained variances for each variable

4. Validation/diagnostics plots:

(a) Leverage and residuals (also called ”score distances”and ”orthogonal distances”(cf.the nice Figure 3.15, page 79 in the Varmuza-book)

(b) The ”influence plot”: residuals versus leverage

5. Jacknifing/bootstrapping/Crossvalidating the PCA for various purposes:

(a) Deciding on number of components

(b) Sensitivity/uncertainty investigation of scores and loadings.

What is PCA: Developed by Karl Pearson in 1901:

Pearson, K. (1901) On lines and planes of closest fit to systems of points in space. PhilosophicalMagazine (6) 2: 559-572.

Page 4: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.1 READING ABOUT PCA 4

May also be called:

• Singular value decomposition

• Karhunen-Loeve expansion

• Eigenvector analysis

• Latent vector analysis

• Characteristic vector analysis

PCA is used for many things:

• Projection method

• Exploratory data analysis

• Extract information and remove noise

• Reduce dimensionality / Compression

• (Clustering)

And can be described/expressed in many ways:

Page 5: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.1 READING ABOUT PCA 5

• Produces optimal low-dimensional plots of observations (scores)

• Provides an overview of the variable correlation structure (loadings)

• Finds linear combinations of maximal variance

• Orthogonal distance regression method

• A bilinear model for the data

And can be described/expressed in many ways:

X : The (centered and scaled) n× p− data matrix

X = Observation Scores×Variable Loadings + Error

X = TPT + E

Page 6: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 6

Xij =A

∑a=1

tia paj + eij

Computations/A bit of math:

• PCA finds X-components with maximal Y-variance:

max||α||=1

Var(Xα)

• PCA is the least squares fit of the bilinear (non linear regression) model:

mint,p ∑

ij(xij −

A

∑a=1

tia paj)2

• PCA is the eigen decomposition of XtX

• PCA is the eigen decomposition of XXt

• PCA is the outcome of (a version of) the NIPALS algorithm

2.2 Example: Fisher’s Iris Data

Below there will be an exercise based on these data with some questions that PCA can be helpfulin answering. Here we examplify a number of visualizations that one could do for such dataincluding PCA-based stuff.

The Fisher Iris data-set is classic, c.f.:

Page 7: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 7

• Fisher, R.A. (1936). The use of multiple measurements in taxonomic problem. Annals ofEugenics 7: 179-188.

• Anderson, E. (1935). The irises of the Gaspe Peninsula. Bulletin of the American Iris Socie-ty 59: 2-5.

There are 150 objects, 50 Iris setosa, 50 Iris versicolor and 50 Iris virginica. The flowers of these150 plants have been measures by a ruler. The variables are sepal length (SL), sepal width (SW),petal length (PL) and petal width PW), all in all only four variables.

The original hypothesis was that I. versicolor was a hybrid of the two other species i.e. I. setosax virginica. I. setosa is diploid; I. virginica is a tetraploid; and I. versicolor is hexaploid.

2.2.1 Data import

The iris data can allready be found within R, so no import is needed:

# Loading package related to Varmuza-book

# (First time you need to install the package)

library(ChemometricsWithRData)

library(ChemometricsWithR)

data(iris)

Or read the IRIS csv-data which is a copy of the file uploaded on CampusNet. Note that the Irisdata given in CampusNet is slightly different from the IRIS data available. First save the dataset on your computer and set the relevant working direcctory in R, e.g. by clikcing ’Session’ andchoosinf ’Set working directory’, or run the following command with the correct chosen folderpath:

setwd("C:/myfolderpath")

And then import the data into R as follows:

JCFiris=read.table("Fisher_JCF.csv",header=T,sep=";",dec=",")

Note that the Iris data given by JCF is slightly different from the IRIS data available in R:

Page 8: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 8

summary(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width

Min. :4.30 Min. :2.00 Min. :1.00 Min. :0.1

1st Qu.:5.10 1st Qu.:2.80 1st Qu.:1.60 1st Qu.:0.3

Median :5.80 Median :3.00 Median :4.35 Median :1.3

Mean :5.84 Mean :3.06 Mean :3.76 Mean :1.2

3rd Qu.:6.40 3rd Qu.:3.30 3rd Qu.:5.10 3rd Qu.:1.8

Max. :7.90 Max. :4.40 Max. :6.90 Max. :2.5

Species

setosa :50

versicolor:50

virginica :50

summary(JCFiris)

X PW PL SW

setosa :50 Min. : 1.0 Min. :10.0 Min. :20.0

versicolor:50 1st Qu.: 3.0 1st Qu.:16.0 1st Qu.:28.0

virginica :50 Median :13.0 Median :44.0 Median :30.0

Mean :11.9 Mean :37.8 Mean :30.6

3rd Qu.:18.0 3rd Qu.:51.0 3rd Qu.:33.0

Max. :25.0 Max. :69.0 Max. :44.0

SL

Min. : 43.0

1st Qu.: 51.0

Median : 58.0

Mean : 62.6

3rd Qu.: 64.0

Max. :699.0

Note the differences: The names, order and scales. AND: an outlier in the JCF-version has beenchanged in the R-version. Look at the first 6 observations:

head(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

Page 9: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 9

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

head(JCFiris)

X PW PL SW SL

1 setosa 2 14 33 50

2 virginica 24 56 31 67

3 virginica 23 51 31 69

4 setosa 2 10 36 46

5 virginica 20 52 30 65

6 virginica 19 51 27 58

The dimensions are the same:

dim(iris)

[1] 150 5

dim(JCFiris)

[1] 150 5

2.2.2 Basic explorative analysis

First we do some classic (univariate) explorative analysis:

# 4 boxplots with color:

par(mar=c(4,2,3,2),mfrow=c(2,2))

for (i in 1:4) boxplot(iris[,i] ~ iris[,5],

col = 1:3, main = names(iris)[i])

Page 10: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 10

setosa versicolor virginica

4.5

5.5

6.5

7.5

Sepal.Length

setosa versicolor virginica

2.0

2.5

3.0

3.5

4.0

Sepal.Width

setosa versicolor virginica

12

34

56

7

Petal.Length

setosa versicolor virginica

0.5

1.0

1.5

2.0

2.5

Petal.Width

The par(mar=c(4,2,3,2)) command controls the four margins of each individual plot in theorder: bottom, left, top, right. This is helpful to make nice multi-plot pages.

# Pairwise scatters:

pairs(iris,col = iris$Species)

Page 11: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 11

Sepal.Length

2.0 2.5 3.0 3.5 4.0

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

0.5 1.0 1.5 2.0 2.5

●●●●

●●

●●

●● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

4.5

5.5

6.5

7.5

●●●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

2.0

2.5

3.0

3.5

4.0

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●● ●

●●

● ●

● ●●

●●

●●

●●

●● ●

●●

●Sepal.Width

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●●

●●

● ●

● ●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●●

●●●

●●

●●

●●●●●

●●

●●●● ●

●● ●● ● ●●

●● ●

●●●

●●

●●

●●

●● ●●●● ●● ●●

● ●●●●

●●●●●

●●

● ●●

●●

●●●

●●

●●

●●

● ●

●●

●●●

● ●●

●●●

●●

●●● ●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●● ●

●●●● ● ●●

●● ●

●●●

●●

●●

●●

● ●●●●● ● ●●●● ●●●

●●● ●●

●●

● ●●

●●

●●●

●●

●●

●●

●●

●●

●●●

● ●●

●●●

●●

● ●●●

●●

●●

●●

● ●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

Petal.Length

●●●●●

●●●●●●●●

●●●●●

●●

●●

●●● ●●●●● ●●●●●●

●●●

●●●●

●●●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

12

34

56

7

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●●●●●

●●

●●●●

●●●

●●

●●●

●●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●●●

0.5

1.0

1.5

2.0

2.5

●●●● ●

●●

●●●

●●●●

●●● ●●

● ●

●●●●

●●●● ●

●● ●

●●●

●●

●● ●●

●● ●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●●●

●●● ●

●●

●●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●● ●● ●

●●●●

●●●

●●●

●●● ●●

●●

●●●●

●●●● ●

●● ●

●●●

●●

●● ●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●●●●

●●●●●●●

●●●

●●● ●●

●●

●●●●

●●●●●●

●●●●●

●●

●●●●

●● ●

●●

●●

●●

●●● ●

●●

●●●

●●●●

●●●●

●●

●●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

Petal.Width

●●●●●

●●●●●●●●●●

●●●●●●

●●

●●●●

●●●●●●●●●●●

●●●●●●

●●●

●●

●●●●

●●●●

●●●

●●●●

●●●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

4.5 5.5 6.5 7.5

●●●● ● ●● ●● ● ●●●● ●●●● ●● ●●● ●● ●● ●●●● ●● ●●● ●●● ●●●● ●●● ●● ●●

●● ●● ●● ●● ●●● ●●●● ●● ● ●● ● ● ●● ● ● ●●●●●● ● ●● ● ●●●●● ●●● ●●● ●● ●

●● ●● ● ●● ●● ●●● ●●● ●● ●●● ●● ●● ● ●●● ● ● ● ●●●● ●●●● ●● ●● ●●●● ●●●

●● ●● ● ●●●● ● ●●●● ● ●●● ●●● ●●●●● ●●●●● ● ●●●● ●●● ●●● ● ● ●● ●● ●●

●●●● ●● ●● ●●● ●● ●● ●●●● ● ●●● ●●●● ●●●●● ●● ● ●●● ●●● ●●● ● ●●●● ●

●● ●●●●● ●● ●●● ●● ● ●● ●●● ●●●● ●●● ●● ●● ●●●● ● ●●●●●●● ●●●● ● ●●

1 2 3 4 5 6 7

●●●●● ●●●●●●●●●● ●●● ●●●●● ●●●●●●●●●●●●●●●●●●●● ● ●●●●●●

●● ●● ●●●● ●●● ●● ●● ●●● ●● ●● ●●●● ●●●● ●●● ●●●●●●● ●●●● ●●●●● ●

●● ●●● ●● ●● ●●●●●●●● ●●● ●● ●● ● ●●● ●● ● ●●● ● ●●●● ●●●● ●●●●●●●

●●●●● ●●●●●●●●●● ●●●●●● ●● ●●● ●●●●● ●●●●●●●●●●●● ●●●●●●●

●●●● ●● ●● ●●● ●● ●●●●● ●● ●● ●●●●● ●●●●● ● ●●●●●●●● ●●● ●●●●● ●

●● ●● ●●●●● ●●● ●● ●●● ●●● ●●●● ●●●● ●● ●● ●●● ●●●● ● ●●● ● ●●●● ●●

1.0 1.5 2.0 2.5 3.0

1.0

1.5

2.0

2.5

3.0

Species

Let us, for the record, have a look at the covariance matrix:

cov(iris[,1:4])

And similarly the correlation matrix:

cor(iris[,1:4])

Page 12: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 12

Sepal.Length Sepal.Width Petal.Length Petal.WidthSepal.Length 0.69 -0.04 1.27 0.52Sepal.Width -0.04 0.19 -0.33 -0.12Petal.Length 1.27 -0.33 3.12 1.30Petal.Width 0.52 -0.12 1.30 0.58

Sepal.Length Sepal.Width Petal.Length Petal.WidthSepal.Length 1.00 -0.12 0.87 0.82Sepal.Width -0.12 1.00 -0.43 -0.37Petal.Length 0.87 -0.43 1.00 0.96Petal.Width 0.82 -0.37 0.96 1.00

2.2.3 PCA of Iris data

First we do a basic PCA on covariances (WITHOUT Standardization - ONLY with centering):(and here using the PCA function of the ChemometricsWithR-package)

irisPC_without=PCA(scale(iris[,1:4], scale = FALSE))

Note that the scale-function is used here to just center the four variables.

# A good selection of 4 core plots:

par(mar=c(4,2,3,2),mfrow=c(2,2))

scoreplot(irisPC_without, col = iris$Species, main = "Scores")

loadingplot(irisPC_without, show.names = TRUE, main = "Loadings")

biplot(irisPC_without, score.col = iris$Species, main = "biplot")

screeplot(irisPC_without, type = "percentage", main = "Explained variance")

Page 13: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 13

−3 −2 −1 0 1 2 3 4

−1.

0−

0.5

0.0

0.5

1.0

Scores

PC 1 (92.5%)

PC

2 (

5.3%

)

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

0.0 0.2 0.4 0.6 0.8 1.0−

0.8

−0.

6−

0.4

−0.

20.

00.

2

Loadings

PC 1 (92.5%)

PC

2 (

5.3%

)

Sepal.Length

Sepal.Width

Petal.Length

Petal.Width

−3 −2 −1 0 1 2 3 4

−3

−2

−1

01

23

4

biplot

PC 1 (92.5%)

PC

2 (

5.3%

)

●●●

●●

●●●

● ●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●● ● ●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●

●● ●

● ●

●●

●●●

●●

●●●

●●

−1 0 1 2

−1

01

2

1 2 3 4

Explained variance

# PCs

% v

aria

nce

020

4060

8010

0

And now the PCA on correlations (WITH Standardization - AND with centering):

irisPC <- PCA(scale(iris[,1:4]))

Note that the scale-function now is used to both center and standardize the four variables - hedefault choice of this function.

par(mar=c(4,2,3,2),mfrow=c(2,2))

scoreplot(irisPC, col = iris$Species, main = "Scores")

Page 14: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 14

loadingplot(irisPC, show.names = TRUE, main = "Loadings")

biplot(irisPC, score.col = iris$Species, main = "biplot")

screeplot(irisPC, type = "percentage", main = "Explained variance")

−3 −2 −1 0 1 2 3

−2

−1

01

2

Scores

PC 1 (73.0%)

PC

2 (

22.9

%)

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●●

−0.2 0.0 0.2 0.4 0.6

−1.

0−

0.8

−0.

6−

0.4

−0.

20.

0

Loadings

PC 1 (73.0%)

PC

2 (

22.9

%)

Sepal.Length

Sepal.Width

Petal.LengthPetal.Width

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

biplot

PC 1 (73.0%)

PC

2 (

22.9

%)

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●●

●●

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

1 2 3 4

Explained variance

# PCs

% v

aria

nce

020

4060

8010

0

There can be other versions of the variance plot, e.g.:

par(mfrow=c(1,2))

plot(1:length(irisPC$var), irisPC$var, cex = 2,

ylab = "variance explained",xlab = "n PC")

Page 15: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 15

lines(1:length(irisPC$var), irisPC$var)

plot(1:length(irisPC$var), irisPC$var/sum(irisPC$var), cex = 2,

ylab = "(explained variance)/(total variance)",xlab = "n PC")

lines(1:length(irisPC$var), irisPC$var/sum(irisPC$var))

1.0 1.5 2.0 2.5 3.0 3.5 4.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

n PC

varia

nce

expl

aine

d

1.0 1.5 2.0 2.5 3.0 3.5 4.0

0.0

0.2

0.4

0.6

n PC

(exp

lain

ed v

aria

nce)

/(to

tal v

aria

nce)

It can be useful to plot more components than just the first two:

# Scores:

pairs(scores(irisPC), col = iris$Species)

Page 16: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 16

PC 1

−2 −1 0 1 2

●●

● ●●

●● ●

●●● ●

●● ● ●●

●●

●●●

● ●●

●●

●●●

● ●●

●● ●

● ●● ●

●●

●●

●●

●●

●●●

●●

●●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●●

●●●

●●

●● ●

●●●

●●● ●

●●●●●

●●

●●●

●●●

●●

●●●

●●●

●● ●

●●●●

●●

●●

●●

●●

● ●●

●●

●● ●●

●●

● ●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●● ●

●●●

●●●

−0.4 −0.2 0.0 0.2 0.4

−3

−2

−1

01

23

●●

●●●

●●●

● ●● ●

●● ●●●

●●

●●●

● ●●

●●

● ●●

● ●●

●● ●

● ●● ●

●●

●●

●●

●●

● ●●

●●

●●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

● ●●

●●●

−2

−1

01

2

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

● PC 2●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ● ●

●●

●●

● ●

●●

●●●

●●

●●

●●

● ●●

●●

●●

● ●

●●

●●

●●

● ●●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●● ●

●●

●●

● ●

● ●

PC 3

−1.

0−

0.5

0.0

0.5

●●

● ●

● ●

●●

●●

●●

●●

● ●●

●●

● ●

● ●

●●

● ●

●●

●●● ●

● ●

●●

● ●

●●

−3 −2 −1 0 1 2 3

−0.

4−

0.2

0.0

0.2

0.4

●●

●●●●●

●●

●●

●●

●●

●●

● ●

●●

● ●● ●

●●

●● ●●

●●●

● ●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●● ●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●● ●

●● ●

●●

● ●

●●

●●

●●

● ●

●● ●

●●

● ●

●●

−1.0 −0.5 0.0 0.5

●●●

● ●● ●

●●

●●

●●

●●

●●

● ●

●●

● ●● ●

●●

●● ●●

●● ●

● ●

● ●

●●

●●

●●

●●

●● ●

●●

● ●

●●

PC 4

# Loadings:

par(mfrow = c(4,4), mar = c(4,4,.1,.1))

for (i in 1:4) for (j in 1:4) loadingplot(irisPC,

show.names = TRUE,pc=c(i,j), cex.lab=0.7)

Page 17: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 17

−0.2 0.0 0.2 0.4 0.6

−0.

20.

00.

20.

40.

6

PC 1 (73.0%)

PC

1 (

73.0

%)

Sepal.Length

Sepal.Width

Petal.LengthPetal.Width

−0.2 0.0 0.2 0.4 0.6−

1.0

−0.

6−

0.2

0.0

PC 1 (73.0%)

PC

2 (

22.9

%)

Sepal.Length

Sepal.Width

Petal.LengthPetal.Width

−0.2 0.0 0.2 0.4 0.6

−0.

50.

00.

5

PC 1 (73.0%)

PC

3 (

3.7%

)

Sepal.Length

Sepal.WidthPetal.Length

Petal.Width

−0.2 0.0 0.2 0.4 0.6

−0.

50.

00.

5

PC 1 (73.0%)

PC

4 (

0.5%

)

Sepal.Length

Sepal.Width

Petal.Length

Petal.Width

−1.0 −0.6 −0.2

−0.

20.

00.

20.

40.

6

PC 2 (22.9%)

PC

1 (

73.0

%)

Sepal.Length

Sepal.Width

Petal.LengthPetal.Width

−1.0 −0.6 −0.2

−1.

0−

0.6

−0.

20.

0

PC 2 (22.9%)

PC

2 (

22.9

%)

Sepal.Length

Sepal.Width

−1.0 −0.6 −0.2−

0.5

0.0

0.5

PC 2 (22.9%)

PC

3 (

3.7%

)

Sepal.Length

Sepal.WidthPetal.Length

Petal.Width

−1.0 −0.6 −0.2

−0.

50.

00.

5

PC 2 (22.9%)

PC

4 (

0.5%

)

Sepal.Length

Sepal.Width

Petal.Length

Petal.Width

−0.5 0.0 0.5

−0.

20.

00.

20.

40.

6

PC 3 (3.7%)

PC

1 (

73.0

%)

Sepal.Length

Sepal.Width

Petal.LengthPetal.Width

−0.5 0.0 0.5

−1.

0−

0.6

−0.

20.

0

PC 3 (3.7%)

PC

2 (

22.9

%)

Sepal.Length

Sepal.Width

Petal.LengthPetal.Width

−0.5 0.0 0.5

−0.

50.

00.

5

PC 3 (3.7%)

PC

3 (

3.7%

)

Sepal.Length

Sepal.WidthPetal.Length

Petal.Width

−0.5 0.0 0.5−

0.5

0.0

0.5

PC 3 (3.7%)

PC

4 (

0.5%

)

Sepal.Length

Sepal.Width

Petal.Length

Petal.Width

−1.0 −0.5 0.0 0.5

−0.

20.

00.

20.

40.

6

PC 4 (0.5%)

PC

1 (

73.0

%)

Sepal.Length

Sepal.Width

Petal.Length Petal.Width

−1.0 −0.5 0.0 0.5

−1.

0−

0.6

−0.

20.

0

PC 4 (0.5%)

PC

2 (

22.9

%)

Sepal.Length

Sepal.Width

Petal.LengthPetal.Width

−1.0 −0.5 0.0 0.5

−0.

50.

00.

5

PC 4 (0.5%)

PC

3 (

3.7%

)

Sepal.Length

Sepal.WidthPetal.Length

Petal.Width

−1.0 −0.5 0.0 0.5

−0.

50.

00.

5

PC 4 (0.5%)

PC

4 (

0.5%

)

Sepal.Length

Sepal.Width

Petal.Length

Petal.Width

A much nicer biplot can be created by the ggbiplot-package: (Now using the prcomp-function todo the PCA)

ir.pca <- prcomp(iris[,1:4],

center = TRUE,

scale. = TRUE)

library(devtools)

# First time install: install_github("ggbiplot", "vqv")

library(ggbiplot)

g <- ggbiplot(ir.pca, obs.scale = 1, var.scale = 1,

Page 18: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 18

groups = iris[,5], ellipse = TRUE,

circle = FALSE)

print(g)

●●

●●

● ●

●●

● ●

●●

Sepal.Length

Sep

al.W

idth

Petal.LengthPetal.Width

−2

−1

0

1

2

−2 0 2PC1 (73.0% explained var.)

PC

2 (2

2.9%

exp

lain

ed v

ar.)

groups

setosa

versicolor

virginica

Generally about interpreting PCA plots:

• Look at variances (scree) - hope for few(2) - look for the ”bend”

• Look at scores and loadings (e.g. biplot)

– Scores: OBSERVATION mapping

∗ preserves inter observation distances (as good as possible)

– Loadings: VARIABLE mapping (correlation structure)

Page 19: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 19

∗ Variables in the SAME DIRECTION from (0,0) AND far away from (0,0) arehighly correlated

– Loadings tell us on which variables the observations differ

– An observation to the right has high values on the variables with (large) loadings tothe right

– An observation to the left has low values on the variables with (large) loadings to theright

• Look at residuals (Orthogonal distances) and leverages (score distances) (Outliers etc)

Finally, let us show some of the diagnostics (residuals) plotting. For this we will use the chemo-

metrics package: (and now the princomp function for the PCA)

library(chemometrics)

irisPCA <- princomp(iris[,1:4], cor = TRUE)

# The score distances res£SDist express the leverage values

# The orthogonal distances express the residuals

## Plots vs object number :

res <- pcaDiagplot(iris[,1:4], irisPCA, a = 2)

Page 20: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 20

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

0 50 100 150

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Object number

Sco

re d

ista

nce

SD

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

0 50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

Object number

Ort

hogo

nal d

ista

nce

OD

## Plot of the two agains each other:

par(mfrow=c(1,2))

plot(res$SDist, res$ODist, type = "n")

text(res$SDist, res$ODist, labels = row.names(iris))

## Explained variance for each variable

pcaVarexpl(iris[,1:4],a=2)

Page 21: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 21

0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

res$SDist

res$

OD

ist

1

2

3

4

5 6

7

8

9

10 11

12

1314

15

16

17

18

19

20

21

22

23

24

2526

27

28

29

3031

32

33

34

35

36

37

38

3940

41

4243

44

45

46 474849

50

51

52

53

54

55

56

57

58

59

60

61

62

63

6465

6667

68

69

70

71

72

7374

7576

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

9293

94

95

96

97

98 99

100

101

102

103

104

105

106

107

108

109 110

111

112

113

114

115

116

117

118

119

120121

122123

124

125

126

127

128

129

130131

132

133

134

135

136

137

138

139

140

141142143

144

145

146

147148

149

150

Sepal.Length Petal.Length

Exp

lain

ed v

aria

nce

0.0

0.2

0.4

0.6

0.8

1.0

# Influence plot: residuals versus leverage

# for different number of components:

par(mfrow=c(2,2))

for (i in 1:4) {res=pcaDiagplot(iris[,1:4],a=i,irisPCA,plot=FALSE)

plot(res$SDist,res$ODist,type="n")

text(res$SDist,res$ODist,labels=row.names(iris))

}

Page 22: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 22

0.0 0.5 1.0 1.5 2.0

0.5

1.0

1.5

2.0

2.5

res$SDist

res$

OD

ist

1

2

3

45

6

78

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

2425

26

27

2829303132

33

34

3536

37

38

39

4041

42

4344

45

46

47

48

49

50

51

52

53

54

5556

57

58

59

60

61

62

63

64

65

6667

68

69

70

71

72

73

7475

7677

7879

80

8182

8384

85

86

87

88

89

9091

92

93

94

95

9697

98

99

100

101

102

103

104

105

106

107

108109

110

111

112 113

114115

116

117

118

119

120

121122 123

124

125126

127128 129

130131

132

133134

135136

137

138139

140141142143 144

145

146

147

148

149

150

0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

res$SDistre

s$O

Dis

t

1

2

345 6

7

89

10 11

1213

14

15

16

17

18

19

20

21

22

23

24

2526

27

2829

3031

32

33

3435

36

37

383940

41

4243

4445

46 47484950

51

52

53

54

55

56

57

58

5960

61

62

63

6465

6667

68

69

70

71

72

7374

7576

77

78

79

80

81

82

8384

85

8687

88

89

90

91

9293

9495

9697

98 99

100

101

102

103104

105

106

107

108

109 110

111

112

113

114

115

116

117

118

119

120121

122123

124

125

126

127

128129

130131

132

133

134

135

136

137

138

139

140

141142143

144

145

146

147148

149

150

0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.1

0.2

0.3

0.4

0.5

res$SDist

res$

OD

ist

1

2

3

45

678 910

11

12

1314

15

16

17

1819

2021

22

23

24

25

2627

28

2930

31

32 33

3435

3637

38

3940

41

42

43

444546 47

4849

50 5152 53

54

55

56

57585960 61

62 63

64

65 66

676869

7071

72

73

74

75

76

7778

7980

818283

84

85

86

87

8889

90

91

92

93 94

95

96

97

98

99

100 101102103

104

105

106107

108

109110111

112

113

114

115

116117

118

119120

121

122

123124

125

126

127

128129

130

131

132

133

134

135

136

137

138

139

140

141

142

143144

145

146

147

148

149

150

0.5 1.0 1.5 2.0 2.5 3.0 3.5

5.0e

−16

1.5e

−15

2.5e

−15

res$SDist

res$

OD

ist

1

2

3

4

56

7

8910

11

1213

14

15

16

17

18

1920

21

22 23

24

252627282930

31 32

3334

35

363738

39

40

41

42

4344 45

46

47

48

4950

5152

53

54

5556

57

58

5960

61

62

63

6465

6667

68

69

70

7172

73

7475

767778

79

8081

82

8384

858687

88

89

90

91

92

93

94

95

969798

99

100

101102

103

104105

106107108

109110

111

112113 114 115

116117

118119120

121

122

123

124 125

126

127128

129130

131

132

133

134

135

136

137138

139

140141

142143

144

145 146

147

148

149

150

Finally, finally let us indicate how one could do some ’re-sampling’ (similar to ”jacknifing”):Leaving out a certain number of the observation and plotting the loadings and/or scores foreach subset data. First the loadings:

# Random samples of a certain proportion of the

# original number of observations are left out

par(mar = c(1,1,1,1), mfrow = c(3,3))

n=length(iris[,1])

leave_out_size=0.50

for (k in 1:9){irisPC=PCA(scale(iris[sample(1:n,round(n*(1-leave_out_size))),1:4]))

loadingplot(irisPC, show.names = TRUE, main = "Loadings")

Page 23: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.2 EXAMPLE: FISHER’S IRIS DATA 23

}

−0.2 0.0 0.2 0.4 0.6

−1.

0−

0.8

−0.

6−

0.4

−0.

20.

0

Loadings

PC 1 (72.8%)

PC

2 (

22.4

%)

Sepal.Length

Sepal.Width

Petal.LengthPetal.Width

−0.6 −0.4 −0.2 0.0 0.2 0.4

0.0

0.2

0.4

0.6

0.8

1.0 Loadings

PC 1 (73.2%)

PC

2 (

23.3

%)

Sepal.Length

Sepal.Width

Petal.LengthPetal.Width

−0.6 −0.4 −0.2 0.0 0.2

−1.

0−

0.8

−0.

6−

0.4

−0.

20.

0

Loadings

PC 1 (71.9%)

PC

2 (

23.9

%) Sepal.Length

Sepal.Width

Petal.Length

Petal.Width

−0.6 −0.4 −0.2 0.0 0.2

0.0

0.2

0.4

0.6

0.8

1.0

Loadings

PC 1 (72.3%)

PC

2 (

24.0

%)

Sepal.Length

Sepal.Width

Petal.LengthPetal.Width

−0.6 −0.4 −0.2 0.0 0.2 0.4

0.0

0.2

0.4

0.6

0.8

1.0 Loadings

PC 1 (74.4%)

PC

2 (

20.8

%)

Sepal.Length

Sepal.Width

Petal.LengthPetal.Width

−0.2 0.0 0.2 0.4 0.6

0.0

0.2

0.4

0.6

0.8

1.0

Loadings

PC 1 (73.1%)

PC

2 (

22.4

%)

Sepal.Length

Sepal.Width

Petal.LengthPetal.Width

−0.6 −0.4 −0.2 0.0 0.2

−1.

0−

0.8

−0.

6−

0.4

−0.

20.

0

Loadings

PC

2 (

23.1

%)

Sepal.Length

Sepal.Width

Petal.LengthPetal.Width

−0.6 −0.4 −0.2 0.0 0.2 0.4

−1.

0−

0.8

−0.

6−

0.4

−0.

20.

0

Loadings

PC

2 (

22.0

%)

Sepal.Length

Sepal.Width

Petal.Length

Petal.Width

−0.2 0.0 0.2 0.4 0.6

−1.

0−

0.8

−0.

6−

0.4

−0.

20.

0

Loadings

PC

2 (

23.4

%)

Sepal.Length

Sepal.Width

Petal.LengthPetal.Width

The the scores:

par(mar = c(1,1,1,1), mfrow = c(3,3))

for (k in 1:9){subsample <- sample(1:n,round(n*(1-leave_out_size)))

irisPC <- PCA(scale(iris[subsample,1:4]))

scoreplot(irisPC, col = iris$Species[subsample], main = "Scores")

}

Page 24: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.3 SPECTRAL DATA EXAMPLE: YARN DATA 24

−3 −2 −1 0 1 2

−2

−1

01

2

Scores

PC 1 (75.4%)

PC

2 (

20.2

%)

●●

●●

●●

●●

●●●

●●

−3 −2 −1 0 1 2 3

−2

−1

01

2

Scores

PC 1 (74.5%)

PC

2 (

21.2

%)

● ●

●●

● ●●

●●

●●

●●

●●

●●

●●

−3 −2 −1 0 1 2

−2

−1

01

2

Scores

PC 1 (71.6%)

PC

2 (

24.5

%)

●●

●●

●●

●●

●● ●

−2 −1 0 1 2 3

−2

−1

01

2

Scores

PC 1 (70.4%)

PC

2 (

25.0

%)

●●

● ● ●

●●

●●

●●

●●

−2 −1 0 1 2

−2

−1

01

2

Scores

PC 1 (69.1%)

PC

2 (

26.3

%)

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

−3 −2 −1 0 1 2 3−

10

12

Scores

PC 1 (74.0%)

PC

2 (

21.1

%)

●●

●●●

●●●

●●

●●

● ●

●●●

●●

−3 −2 −1 0 1 2

−2

−1

01

2

Scores

PC

2 (

23.4

%)

●●

●●

●●

●●

−3 −2 −1 0 1 2

−2

−1

01

2

Scores

PC

2 (

23.8

%)

●●

●●

●●

●●

●●●

−3 −2 −1 0 1 2

−2

−1

01

2Scores

PC

2 (

21.7

%)

●●

●●

● ●

●● ●

●●

● ●●

●●

●●

●●

● ●

●●

● ●●

●●

The choice of showing 9 is arbitrary. Other plots of this re-sampling type could be thought of.

2.3 Spectral data example: yarn data

## Spectral data,

data(yarn) # Part of chemometrics package

# Try: ?yarn

dim(yarn$NIR)

## [1] 28 268

Page 25: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.3 SPECTRAL DATA EXAMPLE: YARN DATA 25

par(mfrow = c(2, 2), mar = c(4, 4, .2, .2))

# Plotting of the 21 individual NIR spectra"

max_X=max(yarn$NIR)

min_X=min(yarn$NIR)

plot(yarn$NIR[1,],type="n",ylim=c(min_X,max_X))

for (i in 1:21) lines(yarn$NIR[i,],col=i)

# Plotting of the 21 individual NIR spectra - centered"

max_X=max(scale(yarn$NIR,scale=F))

min_X=min(scale(yarn$NIR,scale=F))

plot(scale(yarn$NIR[1,],scale=F),type="n",ylim=c(min_X,max_X))

for (i in 1:21) lines(scale(yarn$NIR,scale=F)[i,],col=i)

# Plotting of the 21 individual NIR spectra - centered and scaled"

max_X=max(scale(yarn$NIR))

min_X=min(scale(yarn$NIR))

plot(scale(yarn$NIR[1,]),type="n",ylim=c(min_X,max_X))

for (i in 1:21) lines(scale(yarn$NIR)[i,],col=i)

# Plotting of the principal variances: "

yarnPC <- PCA(scale(yarn$NIR))

plot(1:length(yarnPC$var),yarnPC$var,cex=2)

lines(1:length(yarnPC$var),yarnPC$var)

Page 26: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.3 SPECTRAL DATA EXAMPLE: YARN DATA 26

0 50 100 150 200 250

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Index

yarn

$NIR

[1, ]

0 50 100 150 200 250

−1.

0−

0.5

0.0

0.5

1.0

Index

scal

e(ya

rn$N

IR[1

, ], s

cale

= F

)

0 50 100 150 200 250

−4

−3

−2

−1

01

23

Index

scal

e(ya

rn$N

IR[1

, ])

●●●●●●●●●●●●●●●●●●●●●●●●●

0 5 10 15 20 25

020

4060

8010

012

014

0

1:length(yarnPC$var)

yarn

PC

$var

# Plot of y:

plot(yarn$density,type="n")

lines(yarn$density)

Page 27: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.4 EXERCISES 27

0 5 10 15 20 25

020

4060

8010

0

Index

yarn

$den

sity

2.4 Exercises

Exercise 1 Fisher’s Iris data

First examine the raw data and examine whether there are obvious mistakes. After that onecould use other Unscrambler features to examine the statistical properties of the objects andvariable, but it in this case we go directly to PCA, as this give a very fine overview of the data,and will often show outliers immediately. Perform the PCA with leverage correction and withcentering. Examine the four standard plots (score plot, loading plot, influence plot and explainedvariance plot).

Page 28: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.4 EXERCISES 28

a) How many principal components would you need and what does the first PC (PC1)describe?

b) How many percentage of the variation is described by the first two PCs?

c) Can you find an outlier? It so do you have an idea why thus outlier came about? (loadingsplot or scores plot)? In R: Do you see problem in the influence plot. If there is an outlier, inwhich other plot can you see the problem? If you see severe outliers, remove them fromthe data and run PCA again (and answer a, and b, again)

d) Does a standardization (autoscaling) give a better model? (answer a) and b) again)

e) How many PCs are needed to explain 70%, 75% and 90% of the variation in the data?

f) How many PCs can you maximally get in this dataset?

g) Compare the score and the loading plot, and make a biplot. Do any of the variables ”tellthe same story”?

h) Are any variables more discriminative the others? Are any variables dispensable?

i) Can you see the presupposed classes? Any class overlap?

Page 29: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.4 EXERCISES 29

j) Does the original hypothesis seem to be OK?

Exercise 2 Wine Data (To be presented by Team 1 next time)

The second dataset is called VIN2:

• Forina, M., Armanino, C., Castino, M. and Ubigli, M. (1986). Multivariate data analysis asa discriminating method of the origin of wines. Vitis 25: 189-201.

• Forina, M., Lanteri, S., Armanino, C., Casolino, C. and Casale, M. 2010. V-PARVUS. Anextendable package of programs for data exploration, classification, and correlation. (www.parvus.unige.it)

The dataset VIN2.csv is an Excell CSV file. In this dataset there are 178 objects (Italian wines), thefirst 59 are Barolo wines (B1-B59), the next 71 are Grignolino wines (G60-G130) and the last 48are Barbera wines (S131-S178). These wines have been characterized by 13 variables (chemicaland physical measurements):

1. Alcohol (in %)

2. Malic acid

3. Ash

4. Alkalinity of Ash

5. Magnesium

6. Total phenols

7. Flavanoids

8. Nonflavanoid phenols

9. Proanthocyanins

10. Colour intensity

11. Colour hue

12. OD280 / OD315 of diluted wines

13. Proline (amino acid)

Page 30: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.4 EXERCISES 30

The wine data can allready be found within R, so no import is needed:

# Wines data:

# From the JCF uploaded file:

# Also slightly different from the version in the package

JCFwines=read.table("VIN2.csv",header=T,sep=";",dec=",")

# The wines data from the package:

# The wine class information is here stored in the wine.classes object

data(wines, package = "ChemometricsWithRData")

head(wines)

alcohol malic acid ash ash alkalinity magnesium tot. phenols

[1,] 13.20 1.78 2.14 11.2 100 2.65

[2,] 13.16 2.36 2.67 18.6 101 2.80

[3,] 14.37 1.95 2.50 16.8 113 3.85

[4,] 13.24 2.59 2.87 21.0 118 2.80

[5,] 14.20 1.76 2.45 15.2 112 3.27

[6,] 14.39 1.87 2.45 14.6 96 2.50

flavonoids non-flav. phenols proanth col. int. col. hue OD ratio

[1,] 2.76 0.26 1.28 4.38 1.05 3.40

[2,] 3.24 0.30 2.81 5.68 1.03 3.17

[3,] 3.49 0.24 2.18 7.80 0.86 3.45

[4,] 2.69 0.39 1.82 4.32 1.04 2.93

[5,] 3.39 0.34 1.97 6.75 1.05 2.85

[6,] 2.52 0.30 1.98 5.25 1.02 3.58

proline

[1,] 1050

[2,] 1185

[3,] 1480

[4,] 735

[5,] 1450

[6,] 1290

head(JCFwines)

X Wine F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12

1 S1 Barolo 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92

2 S2 Barolo 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40

3 S3 Barolo 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17

4 S4 Barolo 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45

Page 31: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.4 EXERCISES 31

5 S5 Barolo 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93

6 S6 Barolo 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6.75 1.05 2.85

F13

1 1070

2 1050

3 1190

4 1480

5 735

6 1450

summary(wines)

alcohol malic acid ash ash alkalinity

Min. :11.0 Min. :0.74 Min. :1.36 Min. :10.6

1st Qu.:12.4 1st Qu.:1.60 1st Qu.:2.21 1st Qu.:17.2

Median :13.1 Median :1.87 Median :2.36 Median :19.5

Mean :13.0 Mean :2.34 Mean :2.37 Mean :19.5

3rd Qu.:13.7 3rd Qu.:3.10 3rd Qu.:2.56 3rd Qu.:21.5

Max. :14.8 Max. :5.80 Max. :3.23 Max. :30.0

magnesium tot. phenols flavonoids non-flav. phenols

Min. : 70.0 Min. :0.98 Min. :0.34 Min. :0.130

1st Qu.: 88.0 1st Qu.:1.74 1st Qu.:1.20 1st Qu.:0.270

Median : 98.0 Median :2.35 Median :2.13 Median :0.340

Mean : 99.6 Mean :2.29 Mean :2.02 Mean :0.362

3rd Qu.:107.0 3rd Qu.:2.80 3rd Qu.:2.86 3rd Qu.:0.440

Max. :162.0 Max. :3.88 Max. :5.08 Max. :0.660

proanth col. int. col. hue OD ratio

Min. :0.41 Min. : 1.28 Min. :0.480 Min. :1.27

1st Qu.:1.25 1st Qu.: 3.21 1st Qu.:0.780 1st Qu.:1.93

Median :1.55 Median : 4.68 Median :0.960 Median :2.78

Mean :1.59 Mean : 5.05 Mean :0.957 Mean :2.60

3rd Qu.:1.95 3rd Qu.: 6.20 3rd Qu.:1.120 3rd Qu.:3.17

Max. :3.58 Max. :13.00 Max. :1.710 Max. :4.00

proline

Min. : 278

1st Qu.: 500

Median : 672

Mean : 745

3rd Qu.: 985

Max. :1680

summary(JCFwines)

Page 32: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.4 EXERCISES 32

X Wine F1 F2 F3

S1 : 1 Barbera:48 Min. : 3.67 Min. :0.74 Min. :1.36

S10 : 1 Barolo :59 1st Qu.:12.35 1st Qu.:1.60 1st Qu.:2.21

S100 : 1 Grigno :71 Median :13.05 Median :1.86 Median :2.36

S101 : 1 Mean :12.94 Mean :2.34 Mean :2.37

S102 : 1 3rd Qu.:13.67 3rd Qu.:3.08 3rd Qu.:2.56

S103 : 1 Max. :14.83 Max. :5.80 Max. :3.23

(Other):172

F4 F5 F6 F7

Min. :10.6 Min. : 70.0 Min. :0.98 Min. :0.34

1st Qu.:17.2 1st Qu.: 88.0 1st Qu.:1.74 1st Qu.:1.21

Median :19.5 Median : 98.0 Median :2.35 Median :2.13

Mean :19.5 Mean : 99.7 Mean :2.30 Mean :2.03

3rd Qu.:21.5 3rd Qu.:107.0 3rd Qu.:2.80 3rd Qu.:2.88

Max. :30.0 Max. :162.0 Max. :3.88 Max. :5.08

F8 F9 F10 F11

Min. :0.130 Min. :0.41 Min. : 1.28 Min. :0.480

1st Qu.:0.270 1st Qu.:1.25 1st Qu.: 3.22 1st Qu.:0.782

Median :0.340 Median :1.55 Median : 4.69 Median :0.965

Mean :0.362 Mean :1.59 Mean : 5.06 Mean :0.957

3rd Qu.:0.438 3rd Qu.:1.95 3rd Qu.: 6.20 3rd Qu.:1.120

Max. :0.660 Max. :3.58 Max. :13.00 Max. :1.710

F12 F13

Min. :0.56 Min. : 278

1st Qu.:1.92 1st Qu.: 500

Median :2.78 Median : 674

Mean :2.59 Mean : 753

3rd Qu.:3.17 3rd Qu.: 989

Max. :4.00 Max. :1940

a) Examine the raw data. Are there any severe outliers you can detect? What do you thinkhappened with the outlier, if any?

b) Correct wrong data, if any (in the excel file), and use PCA again. Does the score and loa-ding plot look significantly different now?

Page 33: Principal Component Analysis, PCA, in R · 2016-02-01 · PCA is the eigen decomposition of XtX PCA is the eigen decomposition of XXt PCA is the outcome of (a version of) the NIPALS

eNote 2 2.4 EXERCISES 33

c) Try PCA without standardization: Which variables are important here and why?

d) Try PCA with standardization. Which variables are important here, and would you recom-mend removing any of them from the data set? Which variables are especially importantfor the Barbera wines?

e) Suppose that alcohol % and proanthocyanins were especially healthy which wine wouldyou recommend?

f) Use some re-sampling/jack-knifing methods to test for significance of the variable - are allthe variables stable?