advanced applied multivariate analysis - stat.pitt.edu · what is multivariate analysis? 1 first...

30
Advanced Applied Multivariate Analysis STAT 2221, Spring 2015 Sungkyu Jung Department of Statistics University of Pittsburgh E-mail: [email protected] http://www.stat.pitt.edu/sungkyu/ 1 / 30

Upload: others

Post on 20-Oct-2019

21 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

Advanced Applied Multivariate AnalysisSTAT 2221, Spring 2015

Sungkyu JungDepartment of Statistics

University of Pittsburgh

E-mail: [email protected]://www.stat.pitt.edu/sungkyu/

1 / 30

Page 2: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

General Information

• Course Webpage:http://www.stat.pitt.edu/sungkyu/course/2221Spring15/

• Prerequisite: None (officially), but• probability and inference theory• linear algebra• R, SAS or Matlab programing.

2 / 30

Page 3: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

What is multivariate analysis?

1 First course of statistics: numbers–random variables

2 Second course of statistics: vectors of numbers–randomvectors

• Basis for analysis of more complex objects, e.g. functions,matrices, tensors, images, networks.

3 Data Exploration: visualization of relationships betweenobservations.

4 Discovering and modeling patterns from dataset:Visualization, Clustering, Multivariate distributions.

5 Confirming patterns: Inference.

6 Dimension Reduction: PCA, CCA, SVD.

7 Predictions: Regression, Classification.

3 / 30

Page 4: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

What is a multivariate dataset?

Multivariate statistical analysis concerns multivariate data whereeach observation consisting of many measurements on the samesubject. We suppose the dataset X = {X1, . . . ,Xn} has nobservations (Here, n is called the sample size), and eachobservation Xi = (xi1, . . . , xip) is a vector in Rp (Here, p is calledthe dimension). These are often recorded in a p × n matrix:

X =(X1, · · · ,Xn

)=

x11 · · · xn1...

. . ....

x1p · · · xnp

4 / 30

Page 5: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

Data Exploration - Visualization

1D example – Hidalgo Stamp Data

• n = 485 observed values of thickness for Mexico stamps

• over > 70 years

• During 1980s

• Stamp papers produced in several factories?

• No records. Can we guess by looking at the data?

Izenman and Sommer (1988), here we use data “stamp” in Rpackage BSDA.

5 / 30

Page 6: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

Data Exploration - 1D Hidalgo StampHistograms with different bin widths. How many factories?

Histogram of thickness

h = 0.015

Fre

quen

cy

0.06 0.08 0.10 0.12

050

100

150

200

Histogram of thickness

h = 0.010

Fre

quen

cy

0.06 0.08 0.10 0.12

050

100

200

Histogram of thickness

h = 0.005

Fre

quen

cy

0.06 0.08 0.10 0.12

040

8012

0

Histogram of thickness

h = 0.002

Fre

quen

cy

0.06 0.08 0.10 0.12

020

4060

80

6 / 30

Page 7: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

Data Exploration - 1D Hidalgo StampHistograms with different bin locations. (Fixed bin width 0.012)

Histogram of thickness

thickness

Fre

quen

cy

0.06 0.08 0.10 0.12

050

100

200

Histogram of thickness

thickness

Fre

quen

cy

0.06 0.08 0.10 0.12

050

150

250

Histogram of thickness

thickness

Fre

quen

cy

0.06 0.08 0.10 0.12

050

150

250

Histogram of thickness

thickness

Fre

quen

cy

0.06 0.08 0.10 0.12

050

100

150

7 / 30

Page 8: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

Data Exploration - Kernel density estimate

• Histograms are dependent on binwidth and bin location.

• Smaller binwidth preferred for exploration.

• Different bin locations can obscure important underlyingpatterns; A solution is to average out the effect on differentbin location “Averaged histogram”

• A more elegant solution is kernel density estimate.

• For X1, . . . ,Xn ∼ i.i.d. f (x) (continuous pdf), a kernel densityestimator of f is obtained as

fh(x) =1

n

n∑i=1

1

hK

(x − Xi

h

),

where the kernel K (·) is a function satisfying∫K (x)dx = 1.

8 / 30

Page 9: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

Kernel density estimate - illustration

• Toy data: xi = 3, 5, 9, 11, 12, 14.

• Consider a kernel density estimate with Gaussian kernel

K (t) =1√2π

e−t2/2,

with bandwidth h = 1.

• For each i ,

1

hK

(x − xi

h

)=

1√2πh2

e−(x−xi )

2

2h2 .

9 / 30

Page 10: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

Kernel density estimate - illustration

0 2 4 6 8 10 12 14 16−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

observations

dens

ity

toy data

• Toy data: xi = 3, 5, 9, 11, 12, 14.

10 / 30

Page 11: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

Kernel density estimate - illustration

0 2 4 6 8 10 12 14 16−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

observations

dens

ity

toy data

• overlaid is, for h = 1,

1

hK

(x − x1

h

)=

1√2πh2

e−(x−x1)

2

2h2 .

11 / 30

Page 12: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

Kernel density estimate - illustration

0 2 4 6 8 10 12 14 16−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

observations

dens

ity

toy data

• For all i = 1, . . . , 6,

1

hK

(x − xi

h

)=

1√2πh2

e−(x−xi )

2

2h2 .

12 / 30

Page 13: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

Kernel density estimate - illustration

0 2 4 6 8 10 12 14 16−0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

observations

dens

ity

toy data

• Kernel Density Estimate (KDE) of the pdf:

fh(x) =1

n

n∑i=1

1

hK

(x − Xi

h

),

13 / 30

Page 14: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

KDE for Hidalgo Stamp

0.04 0.08 0.12 0.16

05

1015

2025

KDE

N = 485 Bandwidth = 0.01

Den

sity

0.06 0.10 0.14

010

2030

KDE

N = 485 Bandwidth = 0.005

Den

sity

0.06 0.08 0.10 0.12 0.14

010

2030

40

KDE

N = 485 Bandwidth = 0.0025

Den

sity

0.06 0.08 0.10 0.12

020

4060

KDE

N = 485 Bandwidth = 0.001

Den

sity

14 / 30

Page 15: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

KDE for Hidalgo Stamp

• Small bandwidth leads undersmooth; Large bandwidth leadsoversmooth

• Is it unimodal? Bi-modal? Or several modes? Which modesare really there?

• Choice of bandwidth• Important in practice.• Controversial issue.• Many recommendations (Silverman’s rule of thumb,

Sheather-Jones Plug-In.)• The consensus is that there is never a consensus.

15 / 30

Page 16: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

KDE for Hidalgo StampBandwidth, by Silverman’s rule-of-thumb, is (sample size)−1/5×90% of minimum of two population standard deviation estimates:i) the sample standard deviation, ii) IQR/1.34

0.06 0.08 0.10 0.12 0.14

010

2030

40

KDE

N = 485 Bandwidth = 0.00391

Den

sity

16 / 30

Page 17: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

Data Exploration - Multivariate data

Dimension p = 6 example – Swiss bank notes

• n = 200 Swiss bank notes (See Fig. 1.1, Hardle and Simar)

• Each note (obs.) has p = 6 measurements (variables).

• Additional information: first half are genuine; the other halfare counterfeit.

• Visualization of 6-dim’l data?

• Can use 6 KDEs overlaided with jitterplot for eachmeasurements (variables)

17 / 30

Page 18: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

Swiss bank notes - Marginal KDEsMarginal KDEs overlaided with jitterplot for each of 6 variables.• Informative, realistic when p is small• No information about association between variables.

212 214 216 2180

0.5

1

1.5

2

2.5

x1

kde

129 130 1310

0.5

1

1.5

2

x2

kde

129 130 131 1320

0.5

1

1.5

2

x3

kde

5 10 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

x4

kde

5 10 150

0.2

0.4

0.6

0.8

1

x5

kde

135 140 1450

0.2

0.4

0.6

0.8

x6

kde

18 / 30

Page 19: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

Swiss bank notes - Scatterplot• Pairs of variables are best visualized by scatterplot, e.g. X4 vsX6 below.

• Understood as point clouds (representing the empiricaldistributions)

•(

62

)many pairs to choose from.

7 8 9 10 11 12 13137.5

138

138.5

139

139.5

140

140.5

141

141.5

142

142.5

X4

X6

Swiss bank notes: blue o = genuine, green x = fake

19 / 30

Page 20: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

Swiss bank notes - Scatterplot• Scatters of three variables can also be informative, if software

allows to rotate the axes.• Otherwise, the 3D scatterplot is a scatterplot of linear

combinations of the three variables.

78

910

1112

13

137

138

139

140

141

142

143213.5

214

214.5

215

215.5

216

216.5

X4

Swiss bank notes: blue o = genuine, green x = fake

X6

X1

20 / 30

Page 21: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

Swiss bank notes - scatterplot matrix• A traditional, yet powerful, tool is to construct a matrix of

scatterplots. - Too busy with p = 6.

−1 0 10

0.51

X1

−1−0.500.50

0.5

X2

−1 0 10

0.5

X3

−2 0 20

0.2

X4

−3−2−1010

0.20.4

X5

−2 0 20

0.20.4

X6

−1−0.500.5−1

01

X2

X1

−1 0 1−1

01

X3

X1

−2 0 2−1

01

X4

X1

−3−2−101−1

01

X5

X1

−2 0 2−1

01

X6

X1

−1 0 1−1−0.500.5

X1

X2

−1 0 1−1−0.500.5

X3

X2

−2 0 2−1−0.500.5

X4

X2

−3−2−101−1−0.500.5

X5

X2

−2 0 2−1−0.500.5

X6

X2

−1 0 1−1

01

X1

X3

−1−0.500.5−1

01

X2

X3

−2 0 2−1

01

X4

X3

−3−2−101−1

01

X5

X3

−2 0 2−1

01

X6

X3

−1 0 1−2

02

X1

X4

−1−0.500.5−2

02

X2

X4

−1 0 1−2

02

X3

X4

−3−2−101−2

02

X5

X4

−2 0 2−2

02

X6

X4

−1 0 1−3−2−10

1

X1

X5

−1−0.500.5−3−2−10

1

X2

X5

−1 0 1−3−2−10

1

X3

X5

−2 0 2−3−2−10

1

X4

X5

−2 0 2−3−2−10

1

X6

X5

−1 0 1−2

02

X1

X6

−1−0.500.5−2

02

X2

X6

−1 0 1−2

02

X3

X6

−2 0 2−2

02

X4

X6

−3−2−101−2

02

X5

X6

21 / 30

Page 22: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

Swiss bank notes - scatterplot matrix• Better to visualize with principal component scores.

−2 0 20

0.2

Direction 1

−2 0 20

0.20.4

Direction 2

−1 0 10

0.5

Direction 3

−1 0 10

0.51

Direction 4

−1−0.500.50

1

Direction 5

−0.5 0 0.5012

Direction 6

−2 0 2−2

02

Direction 2

Dire

ctio

n 1

−1 0 1−2

02

Direction 3

Dire

ctio

n 1

−1 0 1−2

02

Direction 4

Dire

ctio

n 1

−1−0.500.5−2

02

Direction 5

Dire

ctio

n 1

−0.5 0 0.5−2

02

Direction 6

Dire

ctio

n 1

−2 0 2

−202

Direction 1

Dire

ctio

n 2

−1 0 1

−202

Direction 3

Dire

ctio

n 2

−1 0 1

−202

Direction 4

Dire

ctio

n 2

−1−0.500.5

−202

Direction 5

Dire

ctio

n 2

−0.5 0 0.5

−202

Direction 6

Dire

ctio

n 2

−2 0 2−1

01

Direction 1

Dire

ctio

n 3

−2 0 2−1

01

Direction 2

Dire

ctio

n 3

−1 0 1−1

01

Direction 4

Dire

ctio

n 3

−1−0.500.5−1

01

Direction 5

Dire

ctio

n 3

−0.5 0 0.5−1

01

Direction 6

Dire

ctio

n 3

−2 0 2−1

01

Direction 1

Dire

ctio

n 4

−2 0 2−1

01

Direction 2

Dire

ctio

n 4

−1 0 1−1

01

Direction 3

Dire

ctio

n 4

−1−0.500.5−1

01

Direction 5

Dire

ctio

n 4

−0.5 0 0.5−1

01

Direction 6

Dire

ctio

n 4

−2 0 2−1

−0.50

0.5

Direction 1

Dire

ctio

n 5

−2 0 2−1

−0.50

0.5

Direction 2

Dire

ctio

n 5

−1 0 1−1

−0.50

0.5

Direction 3

Dire

ctio

n 5

−1 0 1−1

−0.50

0.5

Direction 4

Dire

ctio

n 5

−0.5 0 0.5−1

−0.50

0.5

Direction 6

Dire

ctio

n 5

−2 0 2−0.5

00.5

Direction 1

Dire

ctio

n 6

−2 0 2−0.5

00.5

Direction 2

Dire

ctio

n 6

−1 0 1−0.5

00.5

Direction 3

Dire

ctio

n 6

−1 0 1−0.5

00.5

Direction 4

Dire

ctio

n 6

−1−0.500.5−0.5

00.5

Direction 5

Dire

ctio

n 6

22 / 30

Page 23: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

Swiss bank notes - scatterplot matrix• With principal component scores, we can focus on fewer

combinations; Dimension Reduction

−2 0 20

0.1

0.2

0.3

Direction 1

−2 0 20

0.1

0.2

0.3

0.4

Direction 2

−1 0 10

0.2

0.4

0.6

0.8

Direction 3

−2 0 2

−2

0

2

Direction 2

Dire

ctio

n 1

−1 0 1

−2

0

2

Direction 3

Dire

ctio

n 1

−2 0 2

−2

0

2

Direction 1

Dire

ctio

n 2

−1 0 1

−2

0

2

Direction 3

Dire

ctio

n 2

−2 0 2

−1

0

1

Direction 1

Dire

ctio

n 3

−2 0 2

−1

0

1

Direction 2

Dire

ctio

n 3

23 / 30

Page 24: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

Swiss bank notes - related methods

1 If obtaining succint representation of data is of interest, thenusing the first two principal component scores appeared in thefirst 2× 2 block of the scatterplot matrix would do the job(Principal Component Analysis)

2 Parametric modeling: Are distributions of Swiss bank notemeasurements normal (Gaussian)? (Multivariate NormalDistribution)

3 Without the information on genuine and counterfeit notes,can we classify n = 200 notes into two distinct groups?(Clustering)

4 With the information on genuine and counterfeit notes,• Are the means and covariances of genuine and counterfeit

notes different? (Statistical inference)• When there is a new bank note, how to predict whether the

new note is genuine? (Classification)

24 / 30

Page 25: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

Modern challenges

High dimensional data

1 Gene expression data in Golub et al. (1999)

2 p = 7129 gene expression levels (numeric) for n1 = 47subjects with acute lymphoblastic leukemia (ALL) andn2 = 25 subjects with acute myeloid leukemia (AML).

3 Scientific task is to identify new cancer class and/or to assigntumors to known classes (ALL or AML).

25 / 30

Page 26: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

Golub dataTaking a subgroup of data with 27 ALLs and 11 AMLs.A visualization of the matrix Xp×n. Is it helpful?

Processed gene expressions (Golub)

subject index: 1−27 = ALL, 28−37 = AML

gene

s

5 10 15 20 25 30 35

1000

2000

3000

4000

5000

6000

7000

10

20

30

40

50

60

26 / 30

Page 27: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

Golub dataScatterplot matrix for the first three genes (variables).

−300−200−100 0 1000

2

4

x 10−3

ALL

AML

X1

−100 0 1000

2

4

x 10−3

X2

−200 0 2000

1

2

3

4x 10

−3

X3

−100 0 100

−300

−200

−100

0

100

X2

X1

−200 0 200

−300

−200

−100

0

100

X3

X1

−300−200−100 0 100

−100

0

100

X1

X2

−200 0 200

−100

0

100

X3

X2

−300−200−100 0 100

−200

0

200

X1

X3

−100 0 100

−200

0

200

X2

X3

27 / 30

Page 28: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

Golub dataScatterplot matrix using the first three Principal ComponentScores. A pattern there?

−5 0 5

x 104

0

0.5

1

1.5x 10

−5

ALL

AML

PC1 Direction

−5 0 5

x 104

0

0.5

1

1.5x 10

−5

PC2 Direction

−4 −2 0 2 4 6

x 104

0

0.5

1

1.5

2

x 10−5

PC3 Direction

−5 0 5

x 104

−5

0

5

x 104

PC2 Direction

PC

1 D

irect

ion

−4 −2 0 2 4 6

x 104

−5

0

5

x 104

PC3 Direction

PC

1 D

irect

ion

−5 0 5

x 104

−5

0

5

x 104

PC1 Direction

PC

2 D

irect

ion

−4 −2 0 2 4 6

x 104

−5

0

5

x 104

PC3 Direction

PC

2 D

irect

ion

−5 0 5

x 104

−4

−2

0

2

4

6x 10

4

PC1 Direction

PC

3 D

irect

ion

−5 0 5

x 104

−4

−2

0

2

4

6x 10

4

PC2 Direction

PC

3 D

irect

ion

28 / 30

Page 29: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

Golub dataScatterplot matrix using the a liniear combination of all variables(that leads a good separation of two groups) and two PrincipalComponent Scores. Better pattern?

−1 0 1 2

x 104

0

1

2

3x 10

−4

ALL

AML

Discriminant Dir.

−5 0 5

x 104

0

0.5

1

1.5x 10

−5

PC1 Direction

−5 0 5

x 104

0

0.5

1

1.5x 10

−5

PC2 Direction

−5 0 5

x 104

−2

0

2

4x 10

4

PC1 Direction

Dis

crim

inan

t Dir.

−5 0 5

x 104

−2

−1

0

1

2

x 104

PC2 Direction

Dis

crim

inan

t Dir.

−1 0 1 2

x 104

−5

0

5

x 104

Discriminant Dir.

PC

1 D

irect

ion

−5 0 5

x 104

−5

0

5

x 104

PC2 Direction

PC

1 D

irect

ion

−1 0 1 2

x 104

−5

0

5

x 104

Discriminant Dir.

PC

2 D

irect

ion

−5 0 5

x 104

−5

0

5

x 104

PC1 Direction

PC

2 D

irect

ion

29 / 30

Page 30: Advanced Applied Multivariate Analysis - stat.pitt.edu · What is multivariate analysis? 1 First course of statistics: numbers{random variables 2 Second course of statistics: vectors

Next: review on matrix algebra.

30 / 30