cpsc 881: machine learning pca and mds. 2 copy right notice most slides in this presentation are...

CpSc 881: Machine Learning

PCA and MDS

2

Copy Right Notice

Most slides in this presentation are adopted from slides of text book and various sources. The Copyright belong to the original authors. Thanks!

3

Background: Covariance

X=Temperature Y=Humidity

40 90

40 90

40 90

30 90

15 70

15 70

15 70

30 90

15 70

30 70

30 70

30 90

40 70

30 90

)1(

))((

),cov( 1

n

YYXX

YX

n

i

ii

Covariance: measures thecorrelation between X and Y• cov(X,Y)=0: independent•Cov(X,Y)>0: move same dir•Cov(X,Y)<0: move oppo dir

4

Background: Covariance Matrix

Contains covariance values between all possible dimensions (=attributes):

Example for three attributes (x,y,z):

)),cov(|( jiijijnxn DimDimccC

),cov(),cov(),cov(

),cov(),cov(),cov(

),cov(),cov(),cov(

zzyzxz

zyyyxy

zxyxxx

C

5

Background: eigenvalues AND eigenvectors

Eigenvectors e : C e = e

How to calculate e and :Calculate det(C-I), yields a polynomial (degree n)Determine roots to det(C-I)=0, roots are eigenvalues

Check out any math book such as Elementary Linear Algebra by Howard Anton, Publisher John,Wiley & SonsOr any math packages such as MATLAB

6

An Example

X1 X2 X1' X2'

19 63 -5.1 9.25

39 74 14.9 20.25

30 87 5.9 33.25

30 23 5.9 -30.75

15 35 -9.1 -18.75

15 43 -9.1 -10.75

15 32 -9.1 -21.75

30 73 5.9 19.25

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50

Series1

Mean1=24.1Mean2=53.8

-40

-30

-20

-10

0

10

20

30

40

-15 -10 -5 0 5 10 15 20

Series1

7

Covariance Matrix

C=

Using MATLAB, we find out:Eigenvectors: e1=(-0.98, 0.21), 1=51.8

e2=(0.21, 0.98), 2=560.2

Thus the second eigenvector is more important!

75 106

106 482

8

Principal Component Analysis (PCA)

Used for visualization of complex dataPrinciple Component Analysis: project onto subspace with the most variance

Developed to capture as much of the variation in data as possible

Generic features of principal components

summary variableslinear combinations of the original variablesuncorrelated with each othercapture as much of the original variance as possible

9

PCA Algorithm

1. X Create N x d data matrix, with one row vector xn per data point

2. X subtract mean x from each row vector xn in X

3. Σ covariance matrix of X

Find eigenvectors and eigenvalues of Σ

PC’s the M eigenvectors with largest eigenvalues

10

Principal components

1. principal component (PC1)the direction along which there is greatest variation

2. principal component (PC2)the direction with maximum variation left in data, orthogonal to the direction (i.e. vector) of PC1

3. principal component (PC3)– the direction with maximal variation left in data,

orthogonal to the plane of PC1 and PC2– (Rarely used)

– etc...

11

Geometric Rationale of PCA

objective of PCA is to rigidly rotate the axes of this p-dimensional space to new positions (principal axes) that have the following properties:

ordered such that principal axis 1 has the highest variance, axis 2 has the next highest variance, .... , and axis p has the lowest variancecovariance among each pair of the principal axes is zero (the principal axes are uncorrelated).

12

Example: 3 dimensions => 2 dimensions

13

PCA on all GenesLeukemia data, precursor B and T

• Plot of 34 patients, 8973 dimensions (genes) reduced to 2

14

How many components?

Check the distribution of eigen-values

Take enough many eigen-vectors to cover 80-90% of the variance

15

Problems and limitations

What if very large dimensional data?e.g., Images (d ≥ 104)

Problem:Covariance matrix Σ is size (d2)

d=104 |Σ| = 108

Singular Value Decomposition (SVD)!efficient algorithms available (Matlab)some implementations find just top N eigenvectors

17

Singular Value Decomposition

Problem:#1: Find concepts in text#2: Reduce dimensionality

18

SVD - Definition

A[n x m] = U[n x r] r x r] (V[m x r])T

A: n x m matrix (e.g., n documents, m terms)

U: n x r matrix (n documents, r concepts)

: r x r diagonal matrix (strength of each ‘concept’) (r: rank of the matrix)

V: m x r matrix (m terms, r concepts)

19

SVD - Properties

THEOREM [Press+92]: always possible to decompose matrix A into A = U VT , where

U, V: unique (*)

U, V: column orthonormal (ie., columns are unit vectors, orthogonal to each other)

UTU = I; VTV = I (I: identity matrix)

: singular value are positive, and sorted in decreasing order

20

SVD - Properties

‘spectral decomposition’ of the matrix:

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

= x xu1 u2

1

2

v1

v2

21

SVD - Interpretation

‘documents’, ‘terms’ and ‘concepts’:

U: document-to-concept similarity matrix

V: term-to-concept similarity matrix

: its diagonal elements: ‘strength’ of each concept

Projection:

best axis to project on: (‘best’ = min sum of squares of projection errors)

22

SVD - Example

A = U VT - example:

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

datainf.retrieval

brain lung

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=CS

MD

9.64 0

0 5.29x x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

23

SVD - Example

A = U VT - example:

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

datainf.

retrieval

brain lung

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=CS

MD

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

CS-conceptMD-concept

doc-to-concept similarity matrix

24

SVD - Example

A = U VT - example:

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

datainf.

retrieval

brain lung

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=CS

MD

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

‘strength’ of CS-concept

25

SVD - Example

A = U VT - example:

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

datainf.

retrieval

brain lung

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=CS

MD

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

term-to-conceptsimilarity matrix

CS-concept

26

SVD – Dimensionality reduction

Q: how exactly is dim. reduction done?

A: set the smallest singular values to zero:

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

27

SVD - Dimensionality reduction

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

0.18

0.36

0.18

0.90

0

00

~9.64

x

0.58 0.58 0.58 0 0

x

28

SVD - Dimensionality reduction

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

~

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 0 0

0 0 0 0 00 0 0 0 0

29

Multidimensional Scaling Procedures

Similar in spirit to PCA but it takes a dissimilarity as input

30


The purpose of multidimensional scaling (MDS) is to map the distances between points in a high dimensional space into a lower dimensional space without too much loss of information.

31

Math

MDS seeks values z_1,...,z_N in R^k to minimize the so-called stress function

This is known as least squares or classical multidimensional scaling. A gradient descent algorithm is used to minimize S.

A non-metric form of MDS is Sammons (1996) non-linear mapping.

Here the following stress function is being minimized:

'

'

2'' ||)||(

iiii

iiii

d

zzd

2' ''

( || ||)ii i ii iSTRESS d z z

32

We use MDS to visualize the dissimilarities between objects.

We use MDS to visualize the dissimilarities between objects.

The procedures are very exploratory and their interpretations are as much art as they are science.

33

Examples

The “points” that are represented in multidimensional space can be just about anything.

These objects might be people, in which case MDS can identify clusters of people who are “close” versus “distant” in some real or psychological sense.

34


As long as the “distance” between the objects can be assessed in some fashion, MDS can be used to find the lowest dimensional space that still adequately captures the distances between objects.

Once the number of dimensions is identified, a further challenge is identifying the meaning of those dimensions.

Basic data representation in MDS is a dissimilarity matrix that shows the distance between every possible pair of objects.

The goal of MDS is to faithfully represent these distances with the lowest possible dimensional space.

35


The mathematics behind MDS can be daunting to understand.

Two types: classical (metric) multidimensional scaling and non-metric scaling.

Example: Distances between cities on the globe

36


This table lists the distances between European cities. A multidimensional scaling of these data should be able to recover the two dimensions (North-South x East-West) that we know must underlie the spatial relations among the cities.Athens Berlin Dublin London Madrid Paris Rome Warsaw

Athens 0 1119 1777 1486 1475 1303 646 1013Berlin 1119 0 817 577 1159 545 736 327Dublin 1777 817 0 291 906 489 1182 1135London 1486 577 291 0 783 213 897 904Madrid 1475 1159 906 783 0 652 856 1483Paris 1303 545 489 213 652 0 694 859Rome 646 736 1182 897 856 694 0 839Warsaw 1013 327 1135 904 1483 859 839 0

37


MDS begins by restricting the dimension of the space and then seeking an arrangement of the objects in that restricted space that minimizes the difference between the distances in that space compared to the actual distances.

38


Appropriate number of dimensions are identified…

Objects can be plotted in the multidimensional space…

Determine what objects cluster together and why they might cluster together. The latter issue concerns the meaning of the dimensions and often requires additional information.

39


In the cities data, the meaning is quite clear.

The dimensions refer to the North-South x East-West surface area across which the cities are dispersed.

We would expect MDS to faithfully recreate the map relations among the cities.

40


This arrangement provides the best fit for a one-dimensional model. How good is the fit? We use a statistic called “stress” to judge the goodness-of-fit.

Derived Stimulus Configuration

Euclidean distance model

One Dimensional Plot

.6.4.2-.0-.2-.4-.6

Dim

ensi

on 1

2.0

1.5

1.0

.5

0.0

-.5

-1.0

-1.5

warsawrome

paris

madrid

london

dublin

berlin

athens

41

Smaller stress values indicate better fit. Some rules of thumb for degree of fit are:

Stress Fit

.20 Poor

.10 Fair

.05 Good

.02 Excellent


42

The stress for the one-dimensional model of the cities data is .31, clearly a poor fit.

The poor fit can also be seen in a plot of the actual distances versus the distances in the one-dimensional model, known as a Shepard plot.

Scatterplot of Linear Fit


Disparities

3.02.52.01.51.0.50.0

Dis

tanc

es

3.5

3.0

2.5

2.0

1.5

1.0

.5

0.0

In a good fitting model, the points will lie along a line, sloping upward to the right, showing a one-to-one correspondence between distances in the model space and actual distances. Clearly not evident here.


43



Disparities

4.03.53.02.52.01.51.0.50.0

Dis

tan

ces

4.0

3.5

3.0

2.5

2.0

1.5

1.0

.5

0.0

A two-dimensional model fits very well. The stress value is also quite small (.00902) indicating an exceptional fit. Of course, this is no great surprise for these data.


44



Disparities

543210

Dis

tan

ces

5

4

3

2

1

0

Not any room for a three-dimensional model to improve matters. The stress is .00918, indicating that a third dimension does not help at all.


45

MDS Example: Clusters among Prostate Samples

cpsc 881: machine learning pca and mds. 2 copy right notice most slides in this presentation are...

Documents