multivariate methods

48
Multivariate Methods Nels Johnson and Matt Williams Laboratory for Interdisciplinary Statistical Analysis

Upload: ismet

Post on 24-Feb-2016

143 views

Category:

Documents


0 download

DESCRIPTION

Multivariate Methods. Nels Johnson and Matt Williams Laboratory for Interdisciplinary Statistical Analysis. Outline. Principal Component Analysis Factor Analysis Multivariate T Tests MANOVA Multidimensional Scaling Correspondence Analysis. PCA – Motivating Examples. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Multivariate Methods

Multivariate Methods

Nels Johnson and Matt WilliamsLaboratory for Interdisciplinary

Statistical Analysis

Page 2: Multivariate Methods

Outline

• Principal Component Analysis• Factor Analysis• Multivariate T Tests• MANOVA• Multidimensional Scaling• Correspondence Analysis

Page 3: Multivariate Methods

PCA – Motivating Examples

• You have measured a number of variables concerning the size of aphids. You’d like to reduce the number of variables used for classification.

• You have a bunch of football statistics for teams and would like to organize related teams based on these statistics.

Page 4: Multivariate Methods

What is it?• Based on an eigenvalue decomposition of the covariance

matrix S (or correlation matrix R) of the variables.• Goal: Maximizes the variance of linear combinations of the

variables.• Obtained by transforming the variables so that the covariance

of the new variables is diagonal.• These new variables are called the principal components (PCs)

and their covariance matrix contains the eigenvalues along the diagonal.

• This transformation can be thought of as a rotation of the axes.• Note: No variables are designated as dependent.

Page 5: Multivariate Methods

What do we get out of it?

• We can form an index measure (i.e. a score) or a weighted average of variables based on a subset of the PCs.

• This reduces the number of variables we have to work with.

• With some subject matter area knowledge we might be able to interpret the meaning of some of the PCs based on correlations.

Page 6: Multivariate Methods

How to reduce the number of PCs?

• Pick a proportion of variation you want to explain ahead of time, pick the number of PCs so that the sum of their eigenvalues (the proportion of variation explained by those PCs) is at least that amount.

• Scree Plots• All PCs with eigenvalue >1 (Kaiser’s Rule)• Broken stick method

Page 7: Multivariate Methods

What are some issues?

• The scale variables are measured on matters.– Standardize variables so they are all on same scale.

• Variables with a high amount of variability (i.e. large variance) will naturally steer the decomposition.– Again, standardize the variables.

• When separation occurs perpendicular to an axis (i.e. PC) it might not be picked up without looking at other axes.– Plot the pairwise scores for each PC. This may require

looking at too many graphs to be feasible.

Page 8: Multivariate Methods

Scree Plot

Page 9: Multivariate Methods

Biplot of Scores

Page 10: Multivariate Methods

Factor Analysis – Some Motivating Examples

• You have the ratings people give to their family members in areas such as Kindness, Intelligent, Happy, etc. Want to associate family members with some sort of overall construct of these words.

• You have conducted a survey and want to group question based on a topic they address.

Page 11: Multivariate Methods

What is it?• We assume the variables Y can be summarized by some underlying,

unobserved, and reduced set of variables called factors (you must pick how many factors).

• Goal is to estimate the factors.• After the factors are estimated, the next goal is to orthogonally

rotate the solution to get simpler factors.• For Principal Factor Solution (more later):• Model : Y-μ = loadings*factors + error• var(Y-μ) or corr(Y-μ) = V = loadings*loadingsT + Ψ• The diagonals of H = V – Ψ are called the communalities. They are

R2-like numbers.• Ψ is called the specific variance.

Page 12: Multivariate Methods

How to Estimate the Factors?• Two main ways:

– Principal Component Solution (Not PCA!)• Focuses on the diagonal of V (the variance).• Does poorly on the off diagonal (the covariance).

– Principal Factor Solution• Focuses on the off diagonals of V and pretty much ignores the diagonal.

– Maximum Likelihood Method• Assume normality of error and estimate the factors and loadings using

an iterative MLE method.• May give nonsensical answers (i.e. Haywood case).• Can adjust iterative method so this doesn’t happen.• Rotations are unique.

Page 13: Multivariate Methods

More On Rotations

• If the rotation is orthogonal then– loadings*loadingsT =

loadings*rotation*rotationT*loadingsT = (loadings*rotation)*(loadings*rotation)T

• So we can redistribute the total variance and variation explain by each variable differently among the factors without actually changing them.

• Lots of methods to pick rotations.

Page 14: Multivariate Methods

Interpreting Analysis

• Loadings represent the covariance (or correlation) between factors and variables.

• So we look for high loadings to represent how underlying factors influence variables.

• With some subject matter knowledge we can name factors based off of these loadings (when they make sense).

Page 15: Multivariate Methods

Some Issues

• Results can change depending on model choices (This is a big deal)!– Number of factors– Estimation method– Rotation method

• Haywood cases when using MLE.• Existence of actual factors is suspect.

Page 16: Multivariate Methods

ExampleBefore Rotation After Rotationf1 f2 f3 f1 f2 f3

y1 x x xy2 x x xy3 x x x xy4 x x xy5 x x xy6 x x x

Page 17: Multivariate Methods

Multivariate T Tests

• Univariate t-test– Normal data, with unknown mean and variance

• Hotelling’s T2 Test– Multivariate Normal data with unknown mean and

Covariance

t (X 0) / S2 /n

t2 F1,

T 2 (X 0)T (S /n) 1(X 0)

p1p Tp,

2 Fp, p1

Page 18: Multivariate Methods

One Sample Test

• Assumptions– Observations are independent and multivariate

normal• Testing– Null Hypothesis: μ = μ0 (vectors)

– Alternative: μ ≠ μ0 (vectors)

T 2 (X 0)T (S /n) 1(X 0)

n p(n 1)p Tp,n 1

2 Fp,n p

Page 19: Multivariate Methods

Example: One Sample Test

• We are interested in 3 different types of calcium in the soil

• We wish to test if our observed means are the true means (15,6,2.85)

Y (28.1,7.18,3.09)

T 2 24.559

T 2.05,3,9 16.766

S 140.54 49.68 1.9449.68 72.25 3.681.94 3.68 0.25

Page 20: Multivariate Methods

Two Sample Test

• Assumptions– Two groups of multivariate normal data• Observations are independent• Means may be different but covariance is the same for

both groups

• Testing– Null Hypothesis: μ1 = μ2 (vectors)– Alternative: μ1 ≠ μ2 (vectors)

T 2 (X 1 X 2)T (Sp (

1n1

1n2)) 1(X 1 X 2)

n1 n2 p 1(n1 n2 2)p

Tp,n1 n2 22 Fp,n1 n2 p 1

Page 21: Multivariate Methods

Example: Two Sample Test

• Four psychological tests were given to 32 men and 32 women

• We are interested in seeing if the mean vectors are the same

Y 2 (12.34,13.91,16.66,21.94)

T 2 97.602

T 2.01,4,62 15.373

Sp

7.164 6.047 5.693 4.7016.047 15.89 8.492 5.8565.693 8.492 29.36 13.984.701 5.856 13.98 22.32

Y 1 (15.97,15.91,27.19,22.75)

Page 22: Multivariate Methods

Other Tests

• Two sample paired test– Use difference vector D = X1 – X2

• Partial Tests– Testing μi = μi0 in the presence of the other (p-1)

means• What about more than 2 groups?– We had ANOVA instead of a t-test– Now we have MANOVA instead of a T2

Page 23: Multivariate Methods

Multivariate Analysis of Variance MANOVA

• Suppose we have data organized into several groups, with each observation giving a vector of responses

• We would like to test the hypothesis that all the means for each of the groups are equal

• We can do this in a manner very similar to the univariate Analysis of Variance (ANOVA)

Page 24: Multivariate Methods

MANOVA

• In ANOVA– We compare Sums of Squares within groups to Sums of

Squares between groups– Sums of Squares are the sums of the squared

differences between the observed values and the means• In MANOVA– We compare Sums of Squares matrices from within the

groups to those between the groups– E is the “within” Sums of Squares matrix– H is the “between” Sums of Squares matrix

Page 25: Multivariate Methods

Four Tests

• There are four tests based on the eigenvalues of E-

1H: λ1 > λ2 > … > λs with s ≤ pd• Pillai:

• Lawley-Hotelling:

• Wilk’s Lambda:– (reject for small values)

• Roy’s Largest Root:

1

1 ii1

s

U (s) ii1

s

V (s) i

1ii1

s

111

Page 26: Multivariate Methods

Comparison of the Four Tests

• In the collinear case– The groups have means that lie on a line in space

(approximately)– θ ≥ U(s) ≥ Λ ≥ V(s) in terms of power

• In the diffuse case– The groups means are spread out in a higher

dimensional space (not a line)– θ ≤ U(s) ≤ Λ ≤ V(s) in terms of power

Page 27: Multivariate Methods

Post-Test Analysis

• Just like with ANOVA, after the test we can– Do pair-wise comparisons or contrasts

• In MANOVA we can also– Do tests for the p individual variables– F tests to identify which variables are different

Page 28: Multivariate Methods

Example: Rootstock Data• We wish to compare apple

trees of different rootstocks• We have 8 trees from each

of 6 rootstocks• Our four measurements are

– Trunk girth at 4 years (y1)– Extension growth at 4 years

(y2)– Trunk girth at 15 years (y3)– Extension growth at 15 years

(y4)

Page 29: Multivariate Methods

Rootstock Data

• Test Results– Λ = .154 < Λ.05,4,5,40 = .455

– V(s) = 1.305 > V(s)

.05 = .645

– U(s) = 2.921 > U(s).05

– θ = .652 > θ.05 = .377

• Follow-up tests for individual variables– Y1 : F = 1.93, p = .1094– Y2 : F = 2.91, p = .024– Y3 : F = 11.97, p < .0001– Y4 : F = 12.16, p < .0001

Page 30: Multivariate Methods

Extensions

• Two-way MANOVA• Multivariate Contrasts• Mixed Models• Split plot designs• Profile Analysis• Different R2-like numbers

Page 31: Multivariate Methods

Multidimensional Scaling (MDS)

• Data is a distance or similarity matrix– Many ways to generate

• Goal is to reduce dimension and visualize– Often look at only 2 or 3 dimensions

• Motivating Examples– Number of teeth for different species of mammals– Discriminating between colors (red vs. orange)– Distances between cities

Page 32: Multivariate Methods

Two Kinds of MDS

• Metric scaling (principal coordinates analysis)– Distances (Euclidean) in the reduced dimension

are close to those measured in the full dimension• Non-metric scaling– Rank order of distances in the reduced dimension

are close to those measured in the full dimension

Page 33: Multivariate Methods

Types of Measures

• There are MANY measures that can be used– Depends on type of data– Depends on interest in observations vs. variables

• Properties1. Minimum of 0, D(x,y) = 0 if x = y2. Positive otherwise, D(x,y) > 03. Symmetric, D(x,y) = D(y,x)4. Triangle Inequality, D(x,y) + D(y,z) > D(x,z)

Page 34: Multivariate Methods

Types of Measures

• Measures that satisfy 1-4 are called Metrics• Measures satisfying 1-3 are Semi-metrics• Some measures have negative values and are

called Non-metrics• Certain measures can be plotted or visualized in

a Euclidean space– Distances and relationships plotted are meaningful– This is a stronger property than the triangle

inequality

Page 35: Multivariate Methods

Measures for our Examples

• Mammal teeth - counts of teeth types– Manhattan (city block) distance– Total teeth different between two species

• Difference between colors (Ekman)– Similarity measure – converted to distance– How well people distinguish between colors– We use the Kruskal measure (non-metric)

• Distances between cities– Euclidean distance– Miles between cities

Page 36: Multivariate Methods

Basic Procedure for MDS

• Metric Scaling– Eigenvalue/eigenvector decomposition– Choose a reduced number of components that still

preserves distances– Create new coordinates based on reduced components

• Non-metric scaling– Reduce dimensions but preserve rank order– Done using Isotonic regression and iterative algorithms

Page 37: Multivariate Methods

Examples: Teeth Data

• 32 mammals and 8 categories of teeth• We are interested in how “close” these

mammals are based on their teeth counts• We use city block distance and look at want to

reduce things to 2 dimensions (from 8)

Page 38: Multivariate Methods

Teeth Data

Page 39: Multivariate Methods

Example: Ekman Color Study

• 14 different wavelengths• 31 subjects asked to rate how well they could

distinguish between different pairs• Ratings were averaged and scaled to get a

similarity index between 0 and 1• We use non-metric scaling and look at a

reduction to 2 dimensions (from 14)

Page 40: Multivariate Methods

Color Study

Page 41: Multivariate Methods

Example: Distances between cities

• We have 10 U.S. cities and distances between all pairs

• Can we reduce this distance matrix to a lower dimension like 2 (from 10).

Page 42: Multivariate Methods

City Distances

Page 43: Multivariate Methods

Comments on MDS

• There are MANY measures we can use– Some make more sense than others– It depends on the data and what you are

interested in– Different measures can lead to different results

• How many dimensions should you use?– It’s easiest to explain 2-3 three dimensions– There are different criteria or guidelines for metric

and non-metric scaling

Page 44: Multivariate Methods

One More Example

• Supposed we have data that can be organized into a two-way table or binary or count values.

• For a small table we can do some contingency table analyses like tests for homogeneity or independence.

• For large tables we might like to reduce or summarize the table

• One method is called Correspondence Analysis

Page 45: Multivariate Methods

Correspondence Analysis

• Our distance measure is the Pearson chi-square measure between the observed cell value and its expected value.

• As before, we need to decide if we are interested in our subjects or our variables

• Similar or analogous to PCA and MDS in terms of dimension reduction and interpretation.

• Unfortunately, the terminology is a little different. So be careful.

Page 46: Multivariate Methods

Example: Postal Employees

• Postal employees for 6 positions were drug tested• Results include negative, marijuana, cocaine, and

other• We are interested in identifying any patterns or

trends

Page 47: Multivariate Methods

Postal Employees

Page 48: Multivariate Methods

Sources

• We compiled the information from this talk from Methods of Multivariate Analysis 2nd ed. by Alvin C. Rencher and from our notes from STAT 5504 compiled by Dr. Eric Smith, Dept. of Statistics.

• Thanks! Any questions?