advanced applied multivariate analysis - stat.pitt.edu · what is multivariate analysis? 1 first...

Advanced Applied Multivariate AnalysisSTAT 2221, Spring 2015

Sungkyu JungDepartment of Statistics

University of Pittsburgh

E-mail: [email protected]://www.stat.pitt.edu/sungkyu/

1 / 30

mailto:[email protected]

http://www.stat.pitt.edu/sungkyu/

General Information

• Course Webpage:http://www.stat.pitt.edu/sungkyu/course/2221Spring15/

• Prerequisite: None (officially), but• probability and inference theory• linear algebra• R, SAS or Matlab programing.

2 / 30

http://www.stat.pitt.edu/sungkyu/course/2221Spring15/

What is multivariate analysis?

1 First course of statistics: numbers–random variables

2 Second course of statistics: vectors of numbers–randomvectors

• Basis for analysis of more complex objects, e.g. functions,matrices, tensors, images, networks.

3 Data Exploration: visualization of relationships betweenobservations.

4 Discovering and modeling patterns from dataset:Visualization, Clustering, Multivariate distributions.

5 Confirming patterns: Inference.

6 Dimension Reduction: PCA, CCA, SVD.

7 Predictions: Regression, Classification.

3 / 30

What is a multivariate dataset?

Multivariate statistical analysis concerns multivariate data whereeach observation consisting of many measurements on the samesubject. We suppose the dataset X = {X1, . . . ,Xn} has nobservations (Here, n is called the sample size), and eachobservation Xi = (xi1, . . . , xip) is a vector in Rp (Here, p is calledthe dimension). These are often recorded in a p × n matrix:

X =(X1, · · · ,Xn

)=

x11 · · · xn1...

. . ....

x1p · · · xnp

4 / 30

Data Exploration - Visualization

1D example – Hidalgo Stamp Data

• n = 485 observed values of thickness for Mexico stamps

• over > 70 years

• During 1980s

• Stamp papers produced in several factories?

• No records. Can we guess by looking at the data?

Izenman and Sommer (1988), here we use data “stamp” in Rpackage BSDA.

5 / 30

Data Exploration - 1D Hidalgo StampHistograms with different bin widths. How many factories?

Histogram of thickness

h = 0.015

Fre

quen

cy

0.06 0.08 0.10 0.12

050

100

150

200


h = 0.010

Fre

quen

cy

0.06 0.08 0.10 0.12

050

100

200


h = 0.005

Fre

quen

cy

0.06 0.08 0.10 0.12

040

8012

0


h = 0.002

Fre

quen

cy

0.06 0.08 0.10 0.12

020

4060

80

6 / 30

Data Exploration - 1D Hidalgo StampHistograms with different bin locations. (Fixed bin width 0.012)


thickness

Fre

quen

cy

0.06 0.08 0.10 0.12

050

100

200


thickness

Fre

quen

cy

0.06 0.08 0.10 0.12

050

150

250


thickness

Fre

quen

cy

0.06 0.08 0.10 0.12

050

150

250


thickness

Fre

quen

cy

0.06 0.08 0.10 0.12

050

100

150

7 / 30

Data Exploration - Kernel density estimate

• Histograms are dependent on binwidth and bin location.

• Smaller binwidth preferred for exploration.

• Different bin locations can obscure important underlyingpatterns; A solution is to average out the effect on differentbin location “Averaged histogram”

• A more elegant solution is kernel density estimate.

• For X1, . . . ,Xn ∼ i.i.d. f (x) (continuous pdf), a kernel densityestimator of f is obtained as

fh(x) =1

n

n∑i=1

1

hK

(x − Xi

h

),

where the kernel K (·) is a function satisfying∫K (x)dx = 1.

8 / 30

Kernel density estimate - illustration

• Toy data: xi = 3, 5, 9, 11, 12, 14.

• Consider a kernel density estimate with Gaussian kernel

K (t) =1√2π

e−t2/2,

with bandwidth h = 1.

• For each i ,

1

hK

(x − xi

h

)=

1√2πh2

e−(x−xi )

2

2h2 .

9 / 30


0 2 4 6 8 10 12 14 16−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

observations

dens

ity

toy data

• Toy data: xi = 3, 5, 9, 11, 12, 14.

10 / 30


0 2 4 6 8 10 12 14 16−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

observations

dens

ity

toy data

• overlaid is, for h = 1,

1

hK

(x − x1

h

)=

1√2πh2

e−(x−x1)

2

2h2 .

11 / 30


0 2 4 6 8 10 12 14 16−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

observations

dens

ity

toy data

• For all i = 1, . . . , 6,

1

hK

(x − xi

h

)=

1√2πh2

e−(x−xi )

2

2h2 .

12 / 30


0 2 4 6 8 10 12 14 16−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

observations

dens

ity

toy data

• Kernel Density Estimate (KDE) of the pdf:

fh(x) =1

n

n∑i=1

1

hK

(x − Xi

h

),

13 / 30

KDE for Hidalgo Stamp

0.04 0.08 0.12 0.16

05

1015

2025

KDE

N = 485 Bandwidth = 0.01

Den

sity

0.06 0.10 0.14

010

2030

KDE


Den

sity

0.06 0.08 0.10 0.12 0.14

010

2030

40

KDE

N = 485 Bandwidth = 0.0025

Den

sity

0.06 0.08 0.10 0.12

020

4060

KDE


Den

sity

14 / 30

KDE for Hidalgo Stamp

• Small bandwidth leads undersmooth; Large bandwidth leadsoversmooth

• Is it unimodal? Bi-modal? Or several modes? Which modesare really there?

• Choice of bandwidth• Important in practice.• Controversial issue.• Many recommendations (Silverman’s rule of thumb,

Sheather-Jones Plug-In.)• The consensus is that there is never a consensus.

15 / 30

KDE for Hidalgo StampBandwidth, by Silverman’s rule-of-thumb, is (sample size)−1/5×90% of minimum of two population standard deviation estimates:i) the sample standard deviation, ii) IQR/1.34

0.06 0.08 0.10 0.12 0.14

010

2030

40

KDE

N = 485 Bandwidth = 0.00391

Den

sity

16 / 30

Data Exploration - Multivariate data

Dimension p = 6 example – Swiss bank notes

• n = 200 Swiss bank notes (See Fig. 1.1, Hardle and Simar)

• Each note (obs.) has p = 6 measurements (variables).

• Additional information: first half are genuine; the other halfare counterfeit.

• Visualization of 6-dim’l data?

• Can use 6 KDEs overlaided with jitterplot for eachmeasurements (variables)

17 / 30

Swiss bank notes - Marginal KDEsMarginal KDEs overlaided with jitterplot for each of 6 variables.• Informative, realistic when p is small• No information about association between variables.

212 214 216 2180

0.5

1

1.5

2

2.5

x1

kde

129 130 1310

0.5

1

1.5

2

x2

kde

129 130 131 1320

0.5

1

1.5

2

x3

kde

5 10 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

x4

kde

5 10 150

0.2

0.4

0.6

0.8

1

x5

kde

135 140 1450

0.2

0.4

0.6

0.8

x6

kde

18 / 30

Swiss bank notes - Scatterplot• Pairs of variables are best visualized by scatterplot, e.g. X4 vsX6 below.

• Understood as point clouds (representing the empiricaldistributions)

•(

62

)many pairs to choose from.

7 8 9 10 11 12 13137.5

138

138.5

139

139.5

140

140.5

141

141.5

142

142.5

X4

X6

Swiss bank notes: blue o = genuine, green x = fake

19 / 30

Swiss bank notes - Scatterplot• Scatters of three variables can also be informative, if software

allows to rotate the axes.• Otherwise, the 3D scatterplot is a scatterplot of linear

combinations of the three variables.

78

910

1112

13

137

138

139

140

141

142

143213.5

214

214.5

215

215.5

216

216.5

X4

Swiss bank notes: blue o = genuine, green x = fake

X6

X1

20 / 30

Swiss bank notes - scatterplot matrix• A traditional, yet powerful, tool is to construct a matrix of

scatterplots. - Too busy with p = 6.

−1 0 10

0.51

X1

−1−0.500.50

0.5

X2

−1 0 10

0.5

X3

−2 0 20

0.2

X4

−3−2−1010

0.20.4

X5

−2 0 20

0.20.4

X6

−1−0.500.5−1

01

X2

X1

−1 0 1−1

01

X3

X1

−2 0 2−1

01

X4

X1

−3−2−101−1

01

X5

X1

−2 0 2−1

01

X6

X1

−1 0 1−1−0.500.5

X1

X2

−1 0 1−1−0.500.5

X3

X2

−2 0 2−1−0.500.5

X4

X2

−3−2−101−1−0.500.5

X5

X2

−2 0 2−1−0.500.5

X6

X2

−1 0 1−1

01

X1

X3

−1−0.500.5−1

01

X2

X3

−2 0 2−1

01

X4

X3

−3−2−101−1

01

X5

X3

−2 0 2−1

01

X6

X3

−1 0 1−2

02

X1

X4

−1−0.500.5−2

02

X2

X4

−1 0 1−2

02

X3

X4

−3−2−101−2

02

X5

X4

−2 0 2−2

02

X6

X4

−1 0 1−3−2−10

1

X1

X5

−1−0.500.5−3−2−10

1

X2

X5

−1 0 1−3−2−10

1

X3

X5

−2 0 2−3−2−10

1

X4

X5

−2 0 2−3−2−10

1

X6

X5

−1 0 1−2

02

X1

X6

−1−0.500.5−2

02

X2

X6

−1 0 1−2

02

X3

X6

−2 0 2−2

02

X4

X6

−3−2−101−2

02

X5

X6

21 / 30

Swiss bank notes - scatterplot matrix• Better to visualize with principal component scores.

−2 0 20

0.2

Direction 1

−2 0 20

0.20.4

Direction 2

−1 0 10

0.5

Direction 3

−1 0 10

0.51

Direction 4

−1−0.500.50

1

Direction 5

−0.5 0 0.5012

Direction 6

−2 0 2−2

02

Direction 2

Dire

ctio

n 1

−1 0 1−2

02

Direction 3

Dire

ctio

n 1

−1 0 1−2

02

Direction 4

Dire

ctio

n 1

−1−0.500.5−2

02

Direction 5

Dire

ctio

n 1

−0.5 0 0.5−2

02

Direction 6

Dire

ctio

n 1

−2 0 2

−202

Direction 1

Dire

ctio

n 2

−1 0 1

−202

Direction 3

Dire

ctio

n 2

−1 0 1

−202

Direction 4

Dire

ctio

n 2

−1−0.500.5

−202

Direction 5

Dire

ctio

n 2

−0.5 0 0.5

−202

Direction 6

Dire

ctio

n 2

−2 0 2−1

01

Direction 1

Dire

ctio

n 3

−2 0 2−1

01

Direction 2

Dire

ctio

n 3

−1 0 1−1

01

Direction 4

Dire

ctio

n 3

−1−0.500.5−1

01

Direction 5

Dire

ctio

n 3

−0.5 0 0.5−1

01

Direction 6

Dire

ctio

n 3

−2 0 2−1

01

Direction 1

Dire

ctio

n 4

−2 0 2−1

01

Direction 2

Dire

ctio

n 4

−1 0 1−1

01

Direction 3

Dire

ctio

n 4

−1−0.500.5−1

01

Direction 5

Dire

ctio

n 4

−0.5 0 0.5−1

01

Direction 6

Dire

ctio

n 4

−2 0 2−1

−0.50

0.5

Direction 1

Dire

ctio

n 5

−2 0 2−1

−0.50

0.5

Direction 2

Dire

ctio

n 5

−1 0 1−1

−0.50

0.5

Direction 3

Dire

ctio

n 5

−1 0 1−1

−0.50

0.5

Direction 4

Dire

ctio

n 5

−0.5 0 0.5−1

−0.50

0.5

Direction 6

Dire

ctio

n 5

−2 0 2−0.5

00.5

Direction 1

Dire

ctio

n 6

−2 0 2−0.5

00.5

Direction 2

Dire

ctio

n 6

−1 0 1−0.5

00.5

Direction 3

Dire

ctio

n 6

−1 0 1−0.5

00.5

Direction 4

Dire

ctio

n 6

−1−0.500.5−0.5

00.5

Direction 5

Dire

ctio

n 6

22 / 30

Swiss bank notes - scatterplot matrix• With principal component scores, we can focus on fewer

combinations; Dimension Reduction

−2 0 20

0.1

0.2

0.3

Direction 1

−2 0 20

0.1

0.2

0.3

0.4

Direction 2

−1 0 10

0.2

0.4

0.6

0.8

Direction 3

−2 0 2

−2

0

2

Direction 2

Dire

ctio

n 1

−1 0 1

−2

0

2

Direction 3

Dire

ctio

n 1

−2 0 2

−2

0

2

Direction 1

Dire

ctio

n 2

−1 0 1

−2

0

2

Direction 3

Dire

ctio

n 2

−2 0 2

−1

0

1

Direction 1

Dire

ctio

n 3

−2 0 2

−1

0

1

Direction 2

Dire

ctio

n 3

23 / 30

Swiss bank notes - related methods

1 If obtaining succint representation of data is of interest, thenusing the first two principal component scores appeared in thefirst 2× 2 block of the scatterplot matrix would do the job(Principal Component Analysis)

2 Parametric modeling: Are distributions of Swiss bank notemeasurements normal (Gaussian)? (Multivariate NormalDistribution)

3 Without the information on genuine and counterfeit notes,can we classify n = 200 notes into two distinct groups?(Clustering)

4 With the information on genuine and counterfeit notes,• Are the means and covariances of genuine and counterfeit

notes different? (Statistical inference)• When there is a new bank note, how to predict whether the

new note is genuine? (Classification)

24 / 30

Modern challenges

High dimensional data

1 Gene expression data in Golub et al. (1999)

2 p = 7129 gene expression levels (numeric) for n1 = 47subjects with acute lymphoblastic leukemia (ALL) andn2 = 25 subjects with acute myeloid leukemia (AML).

3 Scientific task is to identify new cancer class and/or to assigntumors to known classes (ALL or AML).

25 / 30

Golub dataTaking a subgroup of data with 27 ALLs and 11 AMLs.A visualization of the matrix Xp×n. Is it helpful?

Processed gene expressions (Golub)

subject index: 1−27 = ALL, 28−37 = AML

gene

s

5 10 15 20 25 30 35

1000

2000

3000

4000

5000

6000

7000

10

20

30

40

50

60

26 / 30

Golub dataScatterplot matrix for the first three genes (variables).

−300−200−100 0 1000

2

4

x 10−3

ALL

AML

X1

−100 0 1000

2

4

x 10−3

X2

−200 0 2000

1

2

3

4x 10

−3

X3

−100 0 100

−300

−200

−100

0

100

X2

X1

−200 0 200

−300

−200

−100

0

100

X3

X1

−300−200−100 0 100

−100

0

100

X1

X2

−200 0 200

−100

0

100

X3

X2

−300−200−100 0 100

−200

0

200

X1

X3

−100 0 100

−200

0

200

X2

X3

27 / 30

Golub dataScatterplot matrix using the first three Principal ComponentScores. A pattern there?

−5 0 5

x 104

0

0.5

1

1.5x 10

−5

ALL

AML

PC1 Direction

−5 0 5

x 104

0

0.5

1

1.5x 10

−5

PC2 Direction

−4 −2 0 2 4 6

x 104

0

0.5

1

1.5

2

x 10−5

PC3 Direction

−5 0 5

x 104

−5

0

5

x 104

PC2 Direction

PC

1 D

irect

ion

−4 −2 0 2 4 6

x 104

−5

0

5

x 104

PC3 Direction

PC

1 D

irect

ion

−5 0 5

x 104

−5

0

5

x 104

PC1 Direction

PC

2 D

irect

ion

−4 −2 0 2 4 6

x 104

−5

0

5

x 104

PC3 Direction

PC

2 D

irect

ion

−5 0 5

x 104

−4

−2

0

2

4

6x 10

4

PC1 Direction

PC

3 D

irect

ion

−5 0 5

x 104

−4

−2

0

2

4

6x 10

4

PC2 Direction

PC

3 D

irect

ion

28 / 30

Golub dataScatterplot matrix using the a liniear combination of all variables(that leads a good separation of two groups) and two PrincipalComponent Scores. Better pattern?

−1 0 1 2

x 104

0

1

2

3x 10

−4

ALL

AML

Discriminant Dir.

−5 0 5

x 104

0

0.5

1

1.5x 10

−5

PC1 Direction

−5 0 5

x 104

0

0.5

1

1.5x 10

−5

PC2 Direction

−5 0 5

x 104

−2

0

2

4x 10

4

PC1 Direction

Dis

crim

inan

t Dir.

−5 0 5

x 104

−2

−1

0

1

2

x 104

PC2 Direction

Dis

crim

inan

t Dir.

−1 0 1 2

x 104

−5

0

5

x 104

Discriminant Dir.

PC

1 D

irect

ion

−5 0 5

x 104

−5

0

5

x 104

PC2 Direction

PC

1 D

irect

ion

−1 0 1 2

x 104

−5

0

5

x 104

Discriminant Dir.

PC

2 D

irect

ion

−5 0 5

x 104

−5

0

5

x 104

PC1 Direction

PC

2 D

irect

ion

29 / 30

Next: review on matrix algebra.

30 / 30

advanced applied multivariate analysis - stat.pitt.edu · what is multivariate analysis? 1 first...

Documents