pca examples & applications · 2018. 2. 9. · •least-squares regression analysis (anova),...
TRANSCRIPT
PCA – Examples & Applications
➢ Objectives:
Showcase PCA analysis – in PC-ORD and the literature
Principal Components (PCA) – PC-ORD
➢ Results: Randomization tests
Principal Components (PCA) – PC-ORD
➢ Results: Species Loadings onto the PC Axes
➢ Use the scaled eigenvectors
Principal Components (PCA) – PC-ORD
➢ Results: Correlation with Axes
Principal Components (PCA) – Example
➢ Results: Graphs
Samples: points Species: vectors
Principal Components (PCA) – PC-ORD
➢ Results: Graphs
Samples: points
Species: vectors
PC ORD recommends
displaying
species as vectors /
samples as points
➢ Rotation by NEDO
Stretch plot along
direction of most
variation for species
NEDO Axes
Correlations
Axis 1: +0.51
Axis 2: -0.74
➢Rotation: Highlights certain patterns. Report in results
Principal Components (PCA) – PC-ORD
➢ Percent of pattern
explained in original
distance matrix
➢ Orthogonality of PCA axes
Principal Components (PCA) – PC-ORD
Principal Components (PCA) – Reporting➢ What type of cross-correlation matrix you used?
➢ If used with community data, justify using this linear
model for species data?
➢ How many axes were interpreted, and what proportion
of variance was explained by these axes?
➢ Principal eigenvectors - Test of significance?
➢ Rotation of the solution? Use of interpretation aids?
Use covariance matrix. Use euclidean distance
Were assumptions of linearity / normality met?
Describe the axes – and the individual / cumulative variance
Not necessary, but an option using randomization tests
Explain overlays and correlations of variables with axes
PCA Example – Upwell
Where do we start ? Data Exploration + Summarization
What do we look for ?
Mean, S.D., Skewness, Kurtosis (Transformations ?)
Value Ranges, Outliers (Typos ?)
PCA Example – Upwell
Principal Component - Results
PCA Example – Upwell
1st Stopping Rule (Eigenvalue > Broken-Stick Eigenvalue)
0 Axes meet this criterion
How Many Axes Give us the Right Answer?
PCA Example – Upwell
2nd Stopping Rule (Eigenvalue > Mean Randomization)
2 Axes meet this criterion
3rd Stopping Rule (p value)
2 Axes meet this criterion
➢ Performing a Randomization Test:
The randomization: shuffle values within variables (columns) and re-
compute correlation matrix and eigenvalues. Repeat many times.
The test: Compare the actual eigenvalues (from test of real data)
against eigenvalues from the randomizations.
Calculate p value as:
1
1
N
n = p
where
n = number of randomizations where test statistic ≥ observed value
N = the total number of randomizations.
Principal Components – Randomizations
Rnd-Lambda – Compare eigenvalue for axis to observed eigenvalue for that axis
• fairly conservative and generally effective criterion
• more effective with uncorrelated variables included in the data, than Avg-Rnd
• performs better than other measures with strongly non-normal data
Rnd-F – Compare pseudo-F-ratio for an axis to the observed pseudo-F for that axis
Pseudo-F-ratio: is the eigenvalue for an axis divided by the sum of the remaining
(smaller) eigenvalues (sum of squares error – the remaining unexplained variance)
• particularly effective against uncorrelated variables
• performs poorly with grossly non-normal error structures
Avg-Rnd – Compare observed eigenvalue for a given axis to the average
eigenvalue obtained for that axis after randomization
• good when the data did not contain uncorrelated variables
• less stringent, too liberal when the data contain uncorrelated variables
Principal Components – Randomizations
PCA Example – Upwell39
PCA Example – Upwell36
PCA Example – Time
PCA Example – MEI
PCA Example – PDO
PCA Example – Upwell
PCA Example – Upwell
PCA Example – Upwell
T IM E
M EIPDO
upwel l 36upwel l 39
PCA_Upwell
Axis 1
Axis
2
T IM E
M EIPDO
upwel l 36upwel l 39
PCA_Upwell
Axis 1
Axis
2
PCA Example – Upwell
T IM E
M EI
PDO
upwel l 36
upwel l 39
PCA_Upwell
Axis 1
Axis
3
PCA Example – Upwell
NOTE: Use the
Euclidean Distance
PCA Example – Upwell
Principal Components – Reporting➢ What type of cross-correlation matrix you used?
➢ If used with community data, justify using this linear
model for species data?
➢ How many axes were interpreted, and what proportion
of variance was explained by these axes?
➢ Principal eigenvectors - Test of significance?
➢ Rotation of the solution? Use of interpretation aids?
Use covariance matrix. Use euclidean distance
Were assumptions of linearity / normality met?
Describe the axes – and the individual / cumulative variance
Not necessary, but an option using randomization tests
Explain overlays and correlations of variables with axes
Principal Components (PCA) – Paper I
➢ Example: Weichler et al. (2004).
➢ Objective: Relate seabird densities to seven
environmental parameters:
(1) water depth, (2) distance to nearest land, (3) number
of trawlers within a radius of 5 km, (4) sea surface
temperature, (5) water temperature difference (0 – 10 m) ,
(6) water temperature difference (0 – 30 m), and (6) water
temperature difference (10 – 50 m)
➢ NOTE: Did Not Report Cross-correlations
of environmental parameters
Principal Components (PCA) – Paper I
➢ Data Manipulations To Avoid Biases:
• Species densities (birds / km 2) were selected as variables
and 10 min intervals (samples), were selected as cases
• Only species seen in at least five counting intervals were
included, an arbitrary choice that allowed covering a wide
spectrum of species while ignoring those with few occurrences
• Only commoner species with numbers exceeding 1% of all
individuals counted were included in the analysis
• Dataset of 46 sections of the cruise tracks. Each section
comprised a hydrographic station approximately midway and
10 min intervals in two opposite directions (4 – 8 km away)
• Sample Size: 46 samples / 7 variables: Ratio of 6.5
Principal Components (PCA) – Paper I
➢ Community-Wide Result: Six principal eigenvalues (> 1),
showing % of variation explained and ecological interpretation
➢ PC Axis Interpretations:
Principal Components (PCA) – Paper I
➢ Community-Wide Result: Loadings for the 11 seabird
species and 7 variables on the six principal eigenvalues
• 3 principal components: 50 % of variance
• 6 principal components: 78 % of variance
L
O
A
D
I
N
G
S
Principal Components (PCA) – Paper I
➢ Community-Wide Result: Axes explained using(strongest)
loadings of different species and environmental variables
➢ Note: Cannot determine which loadings are significant
(what can we use to quantify correlation w axes?)
Principal Components (PCA) – Paper II
➢Example: Ainley, D.G. et al. (2005).
➢ Objective: Relate densities of the 12 most abundant
species of seabirds to 12 habitat variables:
5 biological, 4 oceanographic, 3 geographic (spatial)
82.3%
Principal Components (PCA) – Paper II
➢ Oceanographic variables examined:
sea-surface temperature / salinity, thermocline depth / strength
Date Distance to Fronts Chl
MaxAcoustic
Biomass
Principal Components (PCA) – Paper II
➢ Data Manipulations To Avoid Biases:
• Densities log-transformed to meet normality assumptions
• Nevertheless, residuals generated in the regressions for
some species did not meet those assumptions (Skewness /
Kurtosis Test for Normality of Residuals, p < 0.05)
• Least-squares regression analysis (ANOVA), however,
is a very robust procedure with respect to non-normality
(Seber, 1977, Kleinbaum et al., 1988)
• Yet, while these analyses yield the best linear unbiased
estimator in the absence of normally distributed residuals, p-
values near 0.05 must be viewed with caution (Seber, 1977)
Principal Components (PCA) – Paper II
➢ To avoid double-absences:
• Only 15-min transects in which any given species was
recorded were analyzed
• The total sample size for the 12 species was 1209
➢ Is this an adequate sample size ?
Rule of thumb:
• 5 samples per variable (Tabachnick and Fidell 1989)
• 1209 / 12 ~ 100 samples per variable
Principal Components (PCA) – Paper II
➢ Analysis Methods:
• Principal components analysis (PCA), in combination
with Sidak multiple comparison tests, used to assess
differences in habitat selection among 12 seabird species
• To test for significant differences in habitat affinities
among seabird species, used two one-way ANOVAs:
In the first, tested for differences among PC1 scores of
each species; in the second, compared the PC2 scores
• Differences between two species significant if either
one or both PC scores differed significantly
Principal Components (PCA) – Paper II
➢ Community-Wide Result: First and second PC axes
explained 60% of variance in distribution of 12 species
Principal Components (PCA) – Paper II
➢ Species-specific Results:
• Species mapped onto two
(independent) dimensions
• Pair-wise associations
(tested) denoted by circles
Near
Fronts
Zoop
Prey
Salty, Green
Fish Prey
Principal Components (PCA) – Comparisons
➢ Number of Axes:
- Selected 2 – easy to interpret (Ainley et al. 2005)
- Selected 6 – based on eigenvalues > 1 (Weichler et al. 2004)
➢ Display of Results:
- Plot & table of eigenvalues (Ainley et al. 2005)
- Eigenvalues & interpretation (description) (Weichler et al. 2004)
➢ Significance Tests:
- Pairwise species comparisons (ANOVA) (Ainley et al. 2005)
- Correlations with selected variables (Weichler et al. 2004)
Principal Components (PCA) – ReferencesAinley DG, Spear LB, Tynan CT, Barth JA, Pierce SD, Ford RG, Cowles TJ
(2005). Physical and biological variables affecting seabird distributions
during the upwelling season of the northern California Current. Deep-Sea
Research II 52: 123–143
Weichler T, Garthe S, Luna-Jorquera G, Moraga J (2004). Seabird
distribution on the Humboldt Current in northern Chile in relation to
hydrography, productivity, and fisheries. ICES J. Marine Science
61 (1):148-154
Disclaimer ReferencesSeber, G.A.F. (Ed.), 1977, Linear Regression Analysis. Wiley, New York.
Kleinbaum, D.G., Kupper, L.L., Muller, K.E., 1988. Applied Regression
Analysis and other Multivariable Methods. PWS-KENT Publishing Company,
Boston.
Tabachnik, B.G. and L.S. Fidell. 1989. Using Multivariate Statistics. 2nd ed.
New York: Harper and Row.