cpsc 881: machine learning pca and mds. 2 copy right notice most slides in this presentation are...

45
CpSc 881: Machine Learning PCA and MDS

Upload: claude-ethelbert-chandler

Post on 17-Jan-2016

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

CpSc 881: Machine Learning

PCA and MDS

Page 2: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

2

Copy Right Notice

Most slides in this presentation are adopted from slides of text book and various sources. The Copyright belong to the original authors. Thanks!

Page 3: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

3

Background: Covariance

X=Temperature Y=Humidity

40 90

40 90

40 90

30 90

15 70

15 70

15 70

30 90

15 70

30 70

30 70

30 90

40 70

30 90

)1(

))((

),cov( 1

n

YYXX

YX

n

i

ii

Covariance: measures thecorrelation between X and Y• cov(X,Y)=0: independent•Cov(X,Y)>0: move same dir•Cov(X,Y)<0: move oppo dir

Page 4: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

4

Background: Covariance Matrix

Contains covariance values between all possible dimensions (=attributes):

Example for three attributes (x,y,z):

)),cov(|( jiijijnxn DimDimccC

),cov(),cov(),cov(

),cov(),cov(),cov(

),cov(),cov(),cov(

zzyzxz

zyyyxy

zxyxxx

C

Page 5: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

5

Background: eigenvalues AND eigenvectors

Eigenvectors e : C e = e

How to calculate e and :Calculate det(C-I), yields a polynomial (degree n)Determine roots to det(C-I)=0, roots are eigenvalues

Check out any math book such as Elementary Linear Algebra by Howard Anton, Publisher John,Wiley & SonsOr any math packages such as MATLAB

Page 6: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

6

An Example

X1 X2 X1' X2'

19 63 -5.1 9.25

39 74 14.9 20.25

30 87 5.9 33.25

30 23 5.9 -30.75

15 35 -9.1 -18.75

15 43 -9.1 -10.75

15 32 -9.1 -21.75

30 73 5.9 19.25

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50

Series1

Mean1=24.1Mean2=53.8

-40

-30

-20

-10

0

10

20

30

40

-15 -10 -5 0 5 10 15 20

Series1

Page 7: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

7

Covariance Matrix

C=

Using MATLAB, we find out:Eigenvectors: e1=(-0.98, 0.21), 1=51.8

e2=(0.21, 0.98), 2=560.2

Thus the second eigenvector is more important!

75 106

106 482

Page 8: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

8

Principal Component Analysis (PCA)

Used for visualization of complex dataPrinciple Component Analysis: project onto subspace with the most variance

Developed to capture as much of the variation in data as possible

Generic features of principal components

summary variableslinear combinations of the original variablesuncorrelated with each othercapture as much of the original variance as possible

Page 9: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

9

PCA Algorithm

1. X Create N x d data matrix, with one row vector xn per data point

2. X subtract mean x from each row vector xn in X

3. Σ covariance matrix of X

Find eigenvectors and eigenvalues of Σ

PC’s the M eigenvectors with largest eigenvalues

Page 10: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

10

Principal components

1. principal component (PC1)the direction along which there is greatest variation

2. principal component (PC2)the direction with maximum variation left in data, orthogonal to the direction (i.e. vector) of PC1

3. principal component (PC3)– the direction with maximal variation left in data,

orthogonal to the plane of PC1 and PC2– (Rarely used)

– etc...

Page 11: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

11

Geometric Rationale of PCA

objective of PCA is to rigidly rotate the axes of this p-dimensional space to new positions (principal axes) that have the following properties:

ordered such that principal axis 1 has the highest variance, axis 2 has the next highest variance, .... , and axis p has the lowest variancecovariance among each pair of the principal axes is zero (the principal axes are uncorrelated).

Page 12: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

12

Example: 3 dimensions => 2 dimensions

Page 13: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

13

PCA on all GenesLeukemia data, precursor B and T

• Plot of 34 patients, 8973 dimensions (genes) reduced to 2

Page 14: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

14

How many components?

Check the distribution of eigen-values

Take enough many eigen-vectors to cover 80-90% of the variance

Page 15: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

15

Problems and limitations

What if very large dimensional data?e.g., Images (d ≥ 104)

Problem:Covariance matrix Σ is size (d2)

d=104 |Σ| = 108

Singular Value Decomposition (SVD)!efficient algorithms available (Matlab)some implementations find just top N eigenvectors

Page 16: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

16

Page 17: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

17

Singular Value Decomposition

Problem:#1: Find concepts in text#2: Reduce dimensionality

Page 18: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

18

SVD - Definition

A[n x m] = U[n x r] r x r] (V[m x r])T

A: n x m matrix (e.g., n documents, m terms)

U: n x r matrix (n documents, r concepts)

: r x r diagonal matrix (strength of each ‘concept’) (r: rank of the matrix)

V: m x r matrix (m terms, r concepts)

Page 19: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

19

SVD - Properties

THEOREM [Press+92]: always possible to decompose matrix A into A = U VT , where

U, V: unique (*)

U, V: column orthonormal (ie., columns are unit vectors, orthogonal to each other)

UTU = I; VTV = I (I: identity matrix)

: singular value are positive, and sorted in decreasing order

Page 20: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

20

SVD - Properties

‘spectral decomposition’ of the matrix:

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

= x xu1 u2

1

2

v1

v2

Page 21: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

21

SVD - Interpretation

‘documents’, ‘terms’ and ‘concepts’:

U: document-to-concept similarity matrix

V: term-to-concept similarity matrix

: its diagonal elements: ‘strength’ of each concept

Projection:

best axis to project on: (‘best’ = min sum of squares of projection errors)

Page 22: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

22

SVD - Example

A = U VT - example:

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

datainf.retrieval

brain lung

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=CS

MD

9.64 0

0 5.29x x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

Page 23: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

23

SVD - Example

A = U VT - example:

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

datainf.

retrieval

brain lung

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=CS

MD

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

CS-conceptMD-concept

doc-to-concept similarity matrix

Page 24: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

24

SVD - Example

A = U VT - example:

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

datainf.

retrieval

brain lung

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=CS

MD

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

‘strength’ of CS-concept

Page 25: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

25

SVD - Example

A = U VT - example:

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

datainf.

retrieval

brain lung

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=CS

MD

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

term-to-conceptsimilarity matrix

CS-concept

Page 26: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

26

SVD – Dimensionality reduction

Q: how exactly is dim. reduction done?

A: set the smallest singular values to zero:

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

Page 27: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

27

SVD - Dimensionality reduction

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.18

0.36

0.18

0.90

0

00

~9.64

x

0.58 0.58 0.58 0 0

x

Page 28: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

28

SVD - Dimensionality reduction

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

~

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 0 0

0 0 0 0 00 0 0 0 0

Page 29: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

29

Multidimensional Scaling Procedures

Similar in spirit to PCA but it takes a dissimilarity as input

Page 30: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

30

Multidimensional Scaling Procedures

The purpose of multidimensional scaling (MDS) is to map the distances between points in a high dimensional space into a lower dimensional space without too much loss of information.

Page 31: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

31

Math

MDS seeks values z_1,...,z_N in R^k to minimize the so-called stress function

This is known as least squares or classical multidimensional scaling. A gradient descent algorithm is used to minimize S.

A non-metric form of MDS is Sammons (1996) non-linear mapping.

Here the following stress function is being minimized:

'

'

2'' ||)||(

iiii

iiii

d

zzd

2' ''

( || ||)ii i ii iSTRESS d z z

Page 32: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

32

We use MDS to visualize the dissimilarities between objects.

We use MDS to visualize the dissimilarities between objects.

The procedures are very exploratory and their interpretations are as much art as they are science.

Page 33: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

33

Examples

The “points” that are represented in multidimensional space can be just about anything.

These objects might be people, in which case MDS can identify clusters of people who are “close” versus “distant” in some real or psychological sense.

Page 34: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

34

Multidimensional Scaling Procedures

As long as the “distance” between the objects can be assessed in some fashion, MDS can be used to find the lowest dimensional space that still adequately captures the distances between objects.

Once the number of dimensions is identified, a further challenge is identifying the meaning of those dimensions.

Basic data representation in MDS is a dissimilarity matrix that shows the distance between every possible pair of objects.

The goal of MDS is to faithfully represent these distances with the lowest possible dimensional space.

Page 35: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

35

Multidimensional Scaling Procedures

The mathematics behind MDS can be daunting to understand.

Two types: classical (metric) multidimensional scaling and non-metric scaling.

Example: Distances between cities on the globe

Page 36: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

36

Multidimensional Scaling Procedures

This table lists the distances between European cities. A multidimensional scaling of these data should be able to recover the two dimensions (North-South x East-West) that we know must underlie the spatial relations among the cities.Athens Berlin Dublin London Madrid Paris Rome Warsaw

Athens 0 1119 1777 1486 1475 1303 646 1013Berlin 1119 0 817 577 1159 545 736 327Dublin 1777 817 0 291 906 489 1182 1135London 1486 577 291 0 783 213 897 904Madrid 1475 1159 906 783 0 652 856 1483Paris 1303 545 489 213 652 0 694 859Rome 646 736 1182 897 856 694 0 839Warsaw 1013 327 1135 904 1483 859 839 0

Page 37: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

37

Multidimensional Scaling Procedures

MDS begins by restricting the dimension of the space and then seeking an arrangement of the objects in that restricted space that minimizes the difference between the distances in that space compared to the actual distances.

Page 38: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

38

Multidimensional Scaling Procedures

Appropriate number of dimensions are identified…

Objects can be plotted in the multidimensional space…

Determine what objects cluster together and why they might cluster together. The latter issue concerns the meaning of the dimensions and often requires additional information.

Page 39: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

39

Multidimensional Scaling Procedures

In the cities data, the meaning is quite clear.

The dimensions refer to the North-South x East-West surface area across which the cities are dispersed.

We would expect MDS to faithfully recreate the map relations among the cities.

Page 40: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

40

Multidimensional Scaling Procedures

This arrangement provides the best fit for a one-dimensional model. How good is the fit? We use a statistic called “stress” to judge the goodness-of-fit.

Derived Stimulus Configuration

Euclidean distance model

One Dimensional Plot

.6.4.2-.0-.2-.4-.6

Dim

ensi

on 1

2.0

1.5

1.0

.5

0.0

-.5

-1.0

-1.5

warsawrome

paris

madrid

london

dublin

berlin

athens

Page 41: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

41

Smaller stress values indicate better fit. Some rules of thumb for degree of fit are:

Stress Fit

.20 Poor

.10 Fair

.05 Good

.02 Excellent

Multidimensional Scaling Procedures

Page 42: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

42

The stress for the one-dimensional model of the cities data is .31, clearly a poor fit.

The poor fit can also be seen in a plot of the actual distances versus the distances in the one-dimensional model, known as a Shepard plot.

Scatterplot of Linear Fit

Euclidean distance model

Disparities

3.02.52.01.51.0.50.0

Dis

tanc

es

3.5

3.0

2.5

2.0

1.5

1.0

.5

0.0

In a good fitting model, the points will lie along a line, sloping upward to the right, showing a one-to-one correspondence between distances in the model space and actual distances. Clearly not evident here.

Multidimensional Scaling Procedures

Page 43: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

43

Scatterplot of Linear Fit

Euclidean distance model

Disparities

4.03.53.02.52.01.51.0.50.0

Dis

tan

ces

4.0

3.5

3.0

2.5

2.0

1.5

1.0

.5

0.0

A two-dimensional model fits very well. The stress value is also quite small (.00902) indicating an exceptional fit. Of course, this is no great surprise for these data.

Multidimensional Scaling Procedures

Page 44: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

44

Scatterplot of Linear Fit

Euclidean distance model

Disparities

543210

Dis

tan

ces

5

4

3

2

1

0

Not any room for a three-dimensional model to improve matters. The stress is .00918, indicating that a third dimension does not help at all.

Multidimensional Scaling Procedures

Page 45: CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

45

MDS Example: Clusters among Prostate Samples