http://creativecommons.org/licens es/by-sa/2.0/. principal component analysis & clustering...

39
http:// creativecommons.org/ licenses/by-sa/2.0/

Upload: gerald-dawson

Post on 03-Jan-2016

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

http://creativecommons.org/licenses/by-sa/2.0/

Page 2: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Principal Component Analysis & Clustering

Prof:Rui [email protected]

973702406Dept Ciencies Mediques Basiques,

1st Floor, Room 1.08Website of the

Course:http://web.udl.es/usuaris/pg193845/Courses/Bioinformatics_2007/ Course: http://10.100.14.36/Student_Server/

Page 3: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Complex Datasets

•When studying complex biological samples sometimes there are to many variables•For example, when studying Medaka development using Phospho metabolomics you may have measurements of different amino acids, etc. etc.

•Question: Can we find markers of development using these metabolites?•Question: How do we analyze the data?

Page 4: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Problems

• How do you visually represent the data?– The sample has many dimensions, so plots

are not a good solution

• How do you make sense or extract information from it?– With so many variables how do you know

which ones are important for identifying signatures

Page 5: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Two possible ways (out of many) to address the problems

• PCA

• Clustering

Page 6: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Solution 1: Try data reduction method

• If we can combine the different columns in specific ways, then maybe we can find a way to reduce the number of variables that we need to represent and analyze:

– Principal Component Analysis

Page 7: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Variation in data is what identifies signatures

Metabolite 1 Metabolite 2 Metabolite 3 …

Condition C1

0.01 3 0.1

Condition C2

0.02 0.01 5

Condition C3

0.015 0.8 1.3

Page 8: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Variation in data is what identifies signatures

Virtual Metabolite:

Metabolite 2+ 1/Metabolite 3

Signal Much strong and separates conditions 1, 2, and 3

V. Metab.

C2 C3 C1

0 20

Page 9: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Principal component analysis

• From k “old” variables define k “new” variables that are a linear combination of the old variables:y1 = a11x1 + a12x2 + ... + a1kxk

y2 = a21x1 + a22x2 + ... + a2kxk

...

yk = ak1x1 + ak2x2 + ... + akkxk

New vars Old vars

Page 10: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Defining the New Variables Y

• yk's are uncorrelated (orthogonal)

• y1 explains as much as possible of original variance in data set

• y2 explains as much as possible of remaining variance

• etc.

Page 11: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Principal Components Analysis on:

• Covariance Matrix:– Variables must be in same units– Emphasizes variables with most variance– Mean eigenvalue ≠1.0

• Correlation Matrix:– Variables are standardized (mean 0.0, SD 1.0)– Variables can be in different units– All variables have same impact on analysis– Mean eigenvalue = 1.0

Page 12: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Covariance Matrix

• covariance is the measure of how much two random variables vary together

1, 1 2, 21

1 2

ˆ ˆ( , )

n

i ii

x x x xCov X X

n

X1 X2 X3 …

X1 12 0.03 0.05 …

X2 … 22 3 …

X3

… … 32 …

Page 13: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Covariance Matrix

X1 X2 X3 …

X1 12 0.03 0.05 …

X2 … 22 3 …

X3

… … 32 …

Diagonalize matrix

ovT

C MCovM DDiag D

Page 14: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Eigenvalues are the principal components

11 12 ...

... ... ...

1 ...

a a

D EigenVectors

ak akk

1

ov 2

0 0

0 0

0 0 ...C MDiag

Tells us how much each PC contributes to a data point

Page 15: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Principal Components are Eigenvalues

4.0 4.5 5.0 5.5 6.02

3

4

5

λ1λ2

1st Principal Component, y1

2nd Principal Component, y2

Page 16: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Now we have reduced problem to two variables

0.05 0.00 0.05 0.10 0.15 0.20 0.25

0.10

0.05

0.00

0.05

P C 1

PC

2

Day 1Day 2Day 3Day 4Day 5Day 6Day 7Day 8

Page 17: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

What if things are still a mess?

•Days 3, 4, 5 and 6 do not separate very well

•What could we do to try and improve this?

•Maybe add and extra PC axis to the plot!!!

Page 18: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

0.0

0.1

0.2

P C 1

0.10

0.05

0.00

0.05P C 2

0.1

0.0

0.1

P C 3

Days separate well with three variables

Day 1Day 2Day 3Day 4Day 5Day 6Day 7Day 8

Page 19: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Two possible ways to address the problems

• PCA

• Clustering

Page 20: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Complex Datasets

Page 21: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Solution 2: Try using all data and representing it in a low dimensional figure

• If we can cluster the different days according to some distance function between all amino acids, we can represent the data in an intuitive way.

Page 22: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

What is data clustering?

• Clustering is the classification of objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait

• The number of cluster is usually predefined in advance

Page 23: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Types of data clustering

• Hierarchical– Find successive clusters using previously

established clusters • Agglomerative algorithms begin with each element as a

separate cluster and merge them into successively larger clusters

• Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters

• Partitional – Find all clusters at once

Page 24: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

First things first: distance is important

• Selecting a distance measure will determine how data is agglomerated

2

1

dim

1

1

Euclidean Distance

Manhatan Distance

Mahalanobis Distance

Tchebyshev Distance max

n

ii

x

ii

T

ii

x

x

x Cor x

x

Page 25: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Reducing the data and finding amino acid signatures in development

•Decide on number of clusters: Three clusters

•Do you PCA of the dataset (20 variables, 35 datapoints)

•Use Euclidean Distance

•Use a Hierarchical, Divisive Algorithm

Page 26: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Hierarchical, Divisive Clustering: Step 1 – One Cluster

•Consider all data points as a member of cluster

Page 27: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Hierarchical, Divisive Clustering: Step 1.1 – Building the Second Cluster

centroid

Furthest point from centroid

New seed cluster

Page 28: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Hierarchical, Divisive Clustering: Step 1.1 – Building the Second Cluster

Recalculate centroid

Add new point further from old centroid, closer to new

Rinse and repeat until…

Page 29: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Hierarchical, Divisive Clustering: Step 1.2 – Finishing a Cluster

Recalculate centroids: if both centroids become closer, do not add point & stop adding to cluster

Add new point further from old centroid, closer to new

Page 30: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Hierarchical, Divisive Clustering: Step 2 – Two Clusters

•Use optimization algorithm to divide datapoints in such a way that Euc. Dist. Between all point within each of two clusters is minimal

Page 31: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Hierarchical, Divisive Clustering: Step 3 – Three Clusters

•Continue dividing datapoints until all clusters have been defined

Page 32: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Reducing the data and finding amino acid signatures in development

•Decide on number of clusters: Three clusters

•Use Euclidean Distance

•Use a Hierarchical, Aglomerative Algorithm

Page 33: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Hierarchical, Aglomerative Clustering: Step 1 – 35 Clusters

•Consider each data point as a cluster

Page 34: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Hierarchical, Aglomerative Clustering: Step 2 – Decreasing the number of clusters

•Search for the two datapoint that are closest to each other

•Colapse them into a cluster

•Repeat until you have only three clusters

Page 35: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Reducing the data and finding amino acid signatures in development

•Decide on number of clusters: Three clusters

•Use Euclidean Distance

•Use a Partitional Algorithm

Page 36: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Partitional Clustering

•Search for the three datapoint that are farthest from each other

•Add points to each of these, according to shortest distance

•Repeat until all points have been partitioned to a cluster

Page 37: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Clustering the days of development with amino acid signatures

•Get your data matrix

•Use Euclidean Distance

•Use a Clustering Algorithm

Page 38: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Final Notes on Clustering

•If more than three PC are needed to separate the data we could have used the Principal components matrix and cluster from there

•Clustering can be fuzzy

•Using algorithms such as genetic algorithms, neural networks of bayesian networks one can extract clusters that are completly non-obvious

Page 39: Http://creativecommons.org/licens es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves ralves@cmb.udl.es 973702406 Dept Ciencies Mediques

Summary

• PCA allows for data reduction and decreases dimensions of the datasets to be analyzed

• Clustering allows for classification (independent of PCA) and allows for good visual representations