multivariate analysis intro

28
Lecture #1 - 7/18/2011 Slide 1 of 28 Introduction to Multivariate Analysis Lecture 1 July 18, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Upload: roimarketing

Post on 06-May-2017

257 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Multivariate Analysis Intro

Lecture #1 - 7/18/2011 Slide 1 of 28

Introduction to Multivariate Analysis

Lecture 1

July 18, 2011Advanced Multivariate Statistical Methods

ICPSR Summer Session #2

Page 2: Multivariate Analysis Intro

Overview

● Today’s Lecture

Course Overview

Data Organization

Descriptive Statistics

Graphical Techniques

Distance Measures

Wrapping Up

Lecture #1 - 7/18/2011 Slide 2 of 28

Today’s Lecture

■ Introductions

■ Syllabus and course overview

■ Chapter 1 (a brief review, really):

◆ Data organization/notation

◆ Graphical techniques

◆ Distance measures

■ Introduction to SAS

Page 3: Multivariate Analysis Intro

Overview

Course Overview

● Multivariate

● Course Structure

● Multivariate

Data Organization

Descriptive Statistics

Graphical Techniques

Distance Measures

Wrapping Up

Lecture #1 - 7/18/2011 Slide 3 of 28

Multivariate Statistics and Thinking

■ Although titled “Advanced Multivariate Statistical Methods”this course is an overview of thinking about data andmethods from a multivariate lens:

◆ Many methods fall under the label “multivariate statistics”(e.g., Multivariate ANOVA, Discriminant Analysis, PrincipalComponent Analysis)

◆ Many multivariate statistical distributions exist (e.g.,Multivariate Normal, Wishart)

◆ Many modern (univariate) statistical methods rely onthese multivariate distributions, especially the multivariatenormal distribution

■ This course will focus on multivariate thinking, not just aboutmethods, but also about the foundations of multivariatestatistical analysis

Page 4: Multivariate Analysis Intro

Overview

Course Overview

● Multivariate

● Course Structure

● Multivariate

Data Organization

Descriptive Statistics

Graphical Techniques

Distance Measures

Wrapping Up

Lecture #1 - 7/18/2011 Slide 4 of 28

Course Structure

■ The course is organized around a central topic each week:

1. Foundations of Multivariate Thinking and TheMultivariate Normal Distribution◆ Matrix algebra◆ Multivariate normal distribution

2. Multivariate Normal and Linear Mixed Models◆ Multivariate ANOVA◆ Discrimination/classification◆ Linear models

3. Multivariate Data Reduction Procedures◆ Principal components analysis◆ Factor analysis and structural equation modeling

4. Generalized Multivariate Techniques◆ Distance methods◆ Finite mixture models◆ Categorical distributions

Page 5: Multivariate Analysis Intro

Overview

Course Overview

● Multivariate

● Course Structure

● Multivariate

Data Organization

Descriptive Statistics

Graphical Techniques

Distance Measures

Wrapping Up

Lecture #1 - 7/18/2011 Slide 5 of 28

Multivariate Statistics

A taxonomy of multivariate statistical analyses shows that mosttechniques fall into one of the following categories:

1. Data reduction or structural simplification

2. Sorting and grouping

3. Investigation of the dependence among variables

4. Prediction

5. Hypothesis construction and testing

Page 6: Multivariate Analysis Intro

Overview

Course Overview

Data Organization

● Arrays

Descriptive Statistics

Graphical Techniques

Distance Measures

Wrapping Up

Lecture #1 - 7/18/2011 Slide 6 of 28

Data Organization

■ As a precursor of things to come, here is a preview of theways data are organized in this book/course

■ Multivariate data are a collection of observations (ormeasurements) of:

◆ p variables (k = 1, . . . , p)

◆ n “items” (j = 1, . . . , n)

■ “items” can also be though of assubjects/examinees/individuals or entities (when peopleare not under study)

■ In some disciplines (such as educational measurement),“items” are considered the variables collected perindividual

Page 7: Multivariate Analysis Intro

Lecture #1 - 7/18/2011 Slide 7 of 28

Data Organization

■ xjk = measurement of the kth variable on the jth entity

Variable 1 Variable 2 . . . Variable k . . . Variable p

Item 1: x11 x12 . . . x1k . . . x1p

Item 2: x21 x22 . . . x2k . . . x2p

......

......

...Item j: xj1 xj2 . . . xjk . . . xjp

......

......

...Item n: xn1 xn2 . . . xnk . . . xnp

Page 8: Multivariate Analysis Intro

Overview

Course Overview

Data Organization

● Arrays

Descriptive Statistics

Graphical Techniques

Distance Measures

Wrapping Up

Lecture #1 - 7/18/2011 Slide 8 of 28

Arrays

■ To represent the entire collection of items and entities, arectangular array can be constructed:

X =

x11 x12 . . . x1k . . . x1p

x21 x22 . . . x2k . . . x2p

......

......

xj1 xj2 . . . xjk . . . xjp

......

......

xn1 xn2 . . . xnk . . . xnp

■ In the next class, we will learn about how arrays like thishave an algebra that makes life somewhat easier

■ All arrays will be symbolized by boldfaced font

Page 9: Multivariate Analysis Intro

Overview

Course Overview

Data Organization

● Arrays

Descriptive Statistics

Graphical Techniques

Distance Measures

Wrapping Up

Lecture #1 - 7/18/2011 Slide 9 of 28

Array Example

■ So, putting things all together, envision standing outside ofthe bookstore, asking people for receipts

■ You are interested in looking at two variables:

◆ Variable 1: the total amount of the purchase

◆ Variable 2: the number of books purchased

■ You find four people, and here is what you see observe (withnotation:

x11 = 42 x21 = 52 x31 = 48 x41 = 58

x12 = 4 x22 = 5 x32 = 4 x42 = 3

Page 10: Multivariate Analysis Intro

Overview

Course Overview

Data Organization

● Arrays

Descriptive Statistics

Graphical Techniques

Distance Measures

Wrapping Up

Lecture #1 - 7/18/2011 Slide 10 of 28

Array Example (Continued)

■ The data array would the look like:

X =

x11 x12

x21 x22

x31 x32

x41 x42

=

42 4

52 5

48 4

58 3

■ Notice for any variable, xjk:

◆ The first subscript (j) represents the ROW location in thedata array

◆ The second subscript (k) represents the COLUMNlocation in the data array

Page 11: Multivariate Analysis Intro

Overview

Course Overview

Data Organization

Descriptive Statistics

● Sample Mean

● Sample Variance

● Sample Correlation

Graphical Techniques

Distance Measures

Wrapping Up

Lecture #1 - 7/18/2011 Slide 11 of 28

Descriptive Statistics Review

■ When we have a large amount of data, it is often hard to geta manageable description of the nature of the variablesunder study

■ For this reason (and as a way of introducing a review topicsfrom previous courses), descriptive statistics are used

■ Such descriptive statistics include:

◆ Means

◆ Variances

◆ Covariances

◆ Correlations

Page 12: Multivariate Analysis Intro

Overview

Course Overview

Data Organization

Descriptive Statistics

● Sample Mean

● Sample Variance

● Sample Correlation

Graphical Techniques

Distance Measures

Wrapping Up

Lecture #1 - 7/18/2011 Slide 12 of 28

Sample Mean

■ For the kth variable, the sample mean is:

x̄k =1

n

n∑

j=1

xjk

■ An array of the means for all p variables then looks like this(which we will come to know as the mean vector):

x̄ =

x̄1

x̄2

x̄3

x̄4

Page 13: Multivariate Analysis Intro

Overview

Course Overview

Data Organization

Descriptive Statistics

● Sample Mean

● Sample Variance

● Sample Correlation

Graphical Techniques

Distance Measures

Wrapping Up

Lecture #1 - 7/18/2011 Slide 13 of 28

Sample Variance

■ For the kth variable, the sample variance is:

s2

k = skk =1

n

n∑

j=1

(xjk − x̄k)2

■ Note the “kk” subscript, this will be important because theequation that produces the variance for a single variable is aderivation of the equation of the covariance for a pair ofvariables

■ Also note the division by n

◆ Reasons for this will become apparent in the near future(hint: it’s a type of estimate)

■ For a pair of variables, i and k, the sample covariance is:

sik =1

n

n∑

j=1

(xji − x̄i)(xjk − x̄k)

Page 14: Multivariate Analysis Intro

Overview

Course Overview

Data Organization

Descriptive Statistics

● Sample Mean

● Sample Variance

● Sample Correlation

Graphical Techniques

Distance Measures

Wrapping Up

Lecture #1 - 7/18/2011 Slide 14 of 28

Sample Covariance Matrix

■ Making an array of all sample covariances give us:

Sn =

s11 s12 . . . s1p

s21 s22 . . . s2p

......

. . ....

sp1 sp2 . . . spp

Page 15: Multivariate Analysis Intro

Overview

Course Overview

Data Organization

Descriptive Statistics

● Sample Mean

● Sample Variance

● Sample Correlation

Graphical Techniques

Distance Measures

Wrapping Up

Lecture #1 - 7/18/2011 Slide 15 of 28

Sample Correlation

■ Sample covariances are dependent upon the scale of thevariables under study

■ For this reason, the correlation is often used to describe theassociation between two variables

■ For a pair of variables, i and k, the sample correlation isfound by dividing the sample covariance by the product ofthe standard deviation of the variables:

rik =sik√

sii

√skk

■ The sample correlation:◆ Ranges from -1 to 1◆ Measures linear association◆ Is invariant under linear transformations of i and k◆ Is a biased statistic

Page 16: Multivariate Analysis Intro

Overview

Course Overview

Data Organization

Descriptive Statistics

● Sample Mean

● Sample Variance

● Sample Correlation

Graphical Techniques

Distance Measures

Wrapping Up

Lecture #1 - 7/18/2011 Slide 16 of 28

Sample Correlation Matrix

■ Making an array of all sample correlations give us:

R =

1 r12 . . . r1p

r21 1 . . . r2p

......

. . ....

rp1 rp2 . . . 1

Page 17: Multivariate Analysis Intro

Overview

Course Overview

Data Organization

Descriptive Statistics

Graphical Techniques

● Bivariate Scatterplots

● Trivariate Scatterplots

● Stars

● Chernoff Faces

● Dendrograms

● Variable Space

● Network Diagrams

Distance Measures

Wrapping Up

Lecture #1 - 7/18/2011 Slide 17 of 28

Graphical Techniques

■ Displaying multivariate data can be difficult due to our naturallimitations of seeing the world in three dimensions

■ Several simple ways of displaying data include:

◆ Bivariate scatterplots

◆ Three-dimensional scatterplots

Page 18: Multivariate Analysis Intro

Lecture #1 - 7/18/2011 Slide 18 of 28

Bivariate Scatterplots

Page 19: Multivariate Analysis Intro

Lecture #1 - 7/18/2011 Slide 19 of 28

Trivariate Scatterplots

Page 20: Multivariate Analysis Intro

Overview

Course Overview

Data Organization

Descriptive Statistics

Graphical Techniques

● Bivariate Scatterplots

● Trivariate Scatterplots

● Stars

● Chernoff Faces

● Dendrograms

● Variable Space

● Network Diagrams

Distance Measures

Wrapping Up

Lecture #1 - 7/18/2011 Slide 20 of 28

Graphical Techniques

■ But you likely already have seen those plots

■ Some plots that can be achieved by multivariate methodsinclude:

◆ “Stars”

◆ Chernoff faces

◆ Dendrograms

◆ Bivariate plots, but of the variable space

◆ Network graphs

Page 21: Multivariate Analysis Intro

Lecture #1 - 7/18/2011 Slide 21 of 28

Stars

Page 22: Multivariate Analysis Intro

Lecture #1 - 7/18/2011 Slide 22 of 28

Chernoff Faces

Page 23: Multivariate Analysis Intro

Lecture #1 - 7/18/2011 Slide 23 of 28

Dendrograms

Page 24: Multivariate Analysis Intro

Lecture #1 - 7/18/2011 Slide 24 of 28

Variable Space Plots

Page 25: Multivariate Analysis Intro

Lecture #1 - 7/18/2011 Slide 25 of 28

Network Diagrams

P000000

P000001

P000010

P000011

P000100

P000101

P000110

P000111

P001000

P001001

P001010

P001011

P001100

P001101

P001110

P001111

P010000

P010001

P010010

P010011

P010100

P010101

P010110

P010111

P011000

P011001

P011010

P011011

P011100

P011101

P011110

P011111

P100000

P100001

P100010

P100011

P100100

P100101

P100110

P100111

P101000

P101001

P101010

P101011

P101100P101101

P101110

P101111

P110000

P110001

P110010

P110011

P110100

P110101

P110110

P110111

P111000

P111001

P111010

P111011

P111100

P111101

P111110

P111111

Pajek

Page 26: Multivariate Analysis Intro

Overview

Course Overview

Data Organization

Descriptive Statistics

Graphical Techniques

Distance Measures

Wrapping Up

Lecture #1 - 7/18/2011 Slide 26 of 28

Distance Measures

■ A great number of multivariate techniques revolve around thecomputation of distances:

◆ Distances between variables

◆ Distances between entities (people, objects, etc.)

■ The formula for the Euclidean distance formula between thecoordinate pair P = (x1, x2) and the origin O = (0, 0):

d(O, P ) =√

(x1 − 0)2 + (x2 − 0)2

Page 27: Multivariate Analysis Intro

Overview

Course Overview

Data Organization

Descriptive Statistics

Graphical Techniques

Distance Measures

Wrapping Up

Lecture #1 - 7/18/2011 Slide 27 of 28

Distance Measures

■ Elaborate discussions of distance measures will be foundlater in the class

■ There are also some statistical analogs to distancemeasures, taking the variability of variables into account

■ Also be aware that there are literally an infinite number ofdistance measures!

■ To be considered an actual “distance”, a distance measuremust satisfy the following:

◆ d(P, Q) = d(Q, P )

◆ d(P, Q) > 0 if P 6= Q

◆ d(P, Q) = 0 if P = Q

◆ d(P, Q) ≤ d(P, R) + d(R, Q) (known as the triangleinequality)

Page 28: Multivariate Analysis Intro

Overview

Course Overview

Data Organization

Descriptive Statistics

Graphical Techniques

Distance Measures

Wrapping Up

● Final Thoughts

Lecture #1 - 7/18/2011 Slide 28 of 28

Final Thoughts

■ We introduced what this course will be about - the wild worldof multivariate statistics

■ Things will become increasingly relevant as timeprogresses...but do not hesitate to ask “why?”

■ We will now head down to the lab for a SAS introductionsession

■ Tomorrow’s Class: Matrix algebra (Chapter 2 andSupplement 2A)