lecture 2 data types in computational biology/systems biology useful websites

44
Lecture 2 Data Types in computational biology/Systems biology Useful websites Handling Multivariate data: Concept and types of metrics, distances etc. Introduction to PCA and PLS K-mean clustering

Upload: verda

Post on 21-Mar-2016

58 views

Category:

Documents


2 download

DESCRIPTION

Lecture 2 Data Types in computational biology/Systems biology Useful websites Handling Multivariate data: Concept and types of metrics, distances etc. Introduction to PCA and PLS K-mean clustering. What is systems biology? Each lab/group has its own definition of systems biology. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Lecture  2 Data Types in computational biology/Systems biology Useful websites

Lecture 2Data Types in computational biology/Systems biologyUseful websitesHandling Multivariate data: Concept and types of metrics, distances etc.Introduction to PCA and PLSK-mean clustering

Page 2: Lecture  2 Data Types in computational biology/Systems biology Useful websites

What is systems biology?

Each lab/group has its own definition of systems biology.

This is because systems biology requires the understanding and integration different levels of OMICS information utilizing the knowledge from different branches of science and individual labs/groups are working on different area.

Theoretical target: Understanding life as a system.

Practical Targets: Serving humanity by developing new generation medical tests, drugs, foods, fuel, materials, sensors, logic gates……

Understanding life or even a cell as a system is complicated and requires comprehensive analysis of different data types and/or sub-systems.Mostly individual groups or people work on different sub-systems---

Page 3: Lecture  2 Data Types in computational biology/Systems biology Useful websites

Some of the currently partially available and useful data types:

Genome sequencesBinding motifs in DNA sequences or CIS regulatory regionCODON usageGene expression levels for global gene sets/microRNAsProtein sequencesProtein structuresProtein domainsProtein-protein interactionsBinding relation between proteins and DNARegulatory relation between genesMetabolic PathwaysMetabolite profilesSpecies-metabolite relationsPlants usage in traditional medicines

Usually in wet labs, experiments are conducted to generate such dataIn dry labs like ours we analyze these data to extract targeted information using different algorithms and statistics etc.

Data Types in computational biology/Systems biology

Page 4: Lecture  2 Data Types in computational biology/Systems biology Useful websites

>gi|15223276|ref|NP_171609.1| ANAC001 (Arabidopsis NAC domain containing protein 1); transcription factor [Arabidopsis thaliana]MEDQVGFGFRPNDEELVGHYLRNKIEGNTSRDVEVAISEVNICSYDPWNLRFQSKYKSRDAMWYFFSRRENNKGNRQSRTTVSGKWKLTGESVEVKDQWGFCSEGFRGKIGHKRVLVFLDGRYPDKTKSDWVIHEFHYDLLPEHQRTYVICRLEYKGDDADILSAYAIDPTPAFVPNMTSSAGSVVNQSRQRNSGSYNTYSEYDSANHGQQFNENSNIMQQQPLQGSFNPLLEYDFANHGGQWLSDYIDLQQQVPYLAPYENESEMIWKHVIEENFEFLVDERTSMQQHYSDHRPKKPVSGVLPDDSSDTETGSMIFEDTSSSTDSVGSSDEPGHTRIDDIPSLNIIEPLHNYKAQEQPKQQSKEKVISSQKSECEWKMAEDSIKIPPSTNTVKQSWIVLENAQWNYLKNMIIGVLLFISVISWIILVG

Sequence data (Genome /Protein sequence)

Usually BLAST algorithms based on dynamic programming are used to determine how two or more sequences are matching with each other

Sequence matching/alignments

Page 5: Lecture  2 Data Types in computational biology/Systems biology Useful websites

CODONS

Page 7: Lecture  2 Data Types in computational biology/Systems biology Useful websites

CODON USAGE

Page 8: Lecture  2 Data Types in computational biology/Systems biology Useful websites

CODON USAGE

Page 9: Lecture  2 Data Types in computational biology/Systems biology Useful websites

Multivariate data (Gene expression data/Metabolite profiles)

There are many types of clustering algorithms applicable to multivariate data e.g. hierarchical, K-mean, SOM etc.

Multivariate data also can be modeled using multivariate probability distribution function

Page 10: Lecture  2 Data Types in computational biology/Systems biology Useful websites

Binary relational Data (Protein-protein interactions, Regulatory relation between genes, Metabolic Pathways) are networks.

Clustering is usually used to extract information from networks.

Multivariate data and sequence data also can be easily converted to networks and then network clustering can be applied.

AtpB AtpAAtpG AtpEAtpA AtpHAtpB AtpHAtpG AtpHAtpE AtpH

Page 11: Lecture  2 Data Types in computational biology/Systems biology Useful websites

Useful Websites

Page 12: Lecture  2 Data Types in computational biology/Systems biology Useful websites

www.geneontology.org www.genome.ad.jp/kegg www.ncbi.nlm.nih.gov www.ebi.ac.uk/databases http://www.ebi.ac.uk/uniprot/ http://www.yeastgenome.org/ http://mips.helmholtz-muenchen.de/proj/ppi/ http://www.ebi.ac.uk/trembl http://dip.doe-mbi.ucla.edu/dip/Main.cgi www.ensembl.org

Some websites

Some websites where we can find different types of data and links to other databases

Page 13: Lecture  2 Data Types in computational biology/Systems biology Useful websites

Source: Knowledge-Based Bioinformatics: From Analysis to InterpretationGil Alterovitz, Marco Ramoni (Editors)

Page 14: Lecture  2 Data Types in computational biology/Systems biology Useful websites

Source: Knowledge-Based Bioinformatics: From Analysis to InterpretationGil Alterovitz, Marco Ramoni (Editors)

Page 15: Lecture  2 Data Types in computational biology/Systems biology Useful websites

Source: Knowledge-Based Bioinformatics: From Analysis to InterpretationGil Alterovitz, Marco Ramoni (Editors)

Page 16: Lecture  2 Data Types in computational biology/Systems biology Useful websites

Source: Knowledge-Based Bioinformatics: From Analysis to InterpretationGil Alterovitz, Marco Ramoni (Editors)

Page 17: Lecture  2 Data Types in computational biology/Systems biology Useful websites

Source: Knowledge-Based Bioinformatics: From Analysis to InterpretationGil Alterovitz, Marco Ramoni (Editors)

Page 18: Lecture  2 Data Types in computational biology/Systems biology Useful websites

NETWORK TOOLSSource: Knowledge-Based Bioinformatics: From Analysis to InterpretationGil Alterovitz, Marco Ramoni (Editors)

Page 19: Lecture  2 Data Types in computational biology/Systems biology Useful websites

NETWORK TOOLSSource: Knowledge-Based Bioinformatics: From Analysis to InterpretationGil Alterovitz, Marco Ramoni (Editors)

Page 20: Lecture  2 Data Types in computational biology/Systems biology Useful websites

Source: Knowledge-Based Bioinformatics: From Analysis to InterpretationGil Alterovitz, Marco Ramoni (Editors)

Page 21: Lecture  2 Data Types in computational biology/Systems biology Useful websites

Source: Knowledge-Based Bioinformatics: From Analysis to InterpretationGil Alterovitz, Marco Ramoni (Editors)

Page 22: Lecture  2 Data Types in computational biology/Systems biology Useful websites

Handling Multivariate data: Concept and types of metrics

Multivariate data formatMultivariate data example

Page 23: Lecture  2 Data Types in computational biology/Systems biology Useful websites

Distances, metrics, dissimilarities and similarities are related concepts

A metric is a function that satisfy the following properties:

A function that satisfy only conditions (i)-(iii) is referred to as distances

Source: Bioinformatics and Computational Biology Solutions Using R and Bioconductor (Statistics for Biology and Health)Robert Gentleman ,Vincent Carey ,Wolfgang Huber ,Rafael Irizarry ,Sandrine Dudoit (Editors)

Page 24: Lecture  2 Data Types in computational biology/Systems biology Useful websites

Example:Let,X = (4, 6, 8)Y = (5, 3, 9)

These measures consider the expression measurements as points in some metric space.

Page 25: Lecture  2 Data Types in computational biology/Systems biology Useful websites

Widely used function for finding similarity is Correlation

Correlation gives a measure of linear association between variables and ranges between -1 to +1

Page 26: Lecture  2 Data Types in computational biology/Systems biology Useful websites

Statistical distance between points

The Euclidean distance between point Q and P is larger than that between Q and origin but it seems P and Q are the part of the same cluster but Q and O are not.

Statistical distance /Mahalanobis distance between two vectors can be calculated if the variance-covariance matrix is known or estimated.

Page 27: Lecture  2 Data Types in computational biology/Systems biology Useful websites

Distances between distributions

Different from the previous approach (i.e. considering expression measurements as points in some metric space) the data for each feature can be considered as independent sample from a population.

Therefore the data reflects the underlying population and we need to measure similarities between two densities/distributions.

Kullback-Leibler Information

Mutual information

KLI measures how much the shape of one distribution resembles the other

MI is large when the joint distribution is quiet different from the product of the marginals.

Page 28: Lecture  2 Data Types in computational biology/Systems biology Useful websites

Principle Component Analysis (PCA) and Partial Least Square (PLS)

• Two major common effects of using PCA or PLS Convert a group of correlated predictive variables to a group of

independent variables Construct a “strong” predictive variable from several “weaker”

predictive variables

• Major difference between PCA and PLS PCA is performed without a consideration of the target variable.

So PCA is an unsupervised analysis PLS is performed to maximized the correlation between the

target variable and the predictive variables. So PLS is a supervised analysis

Page 29: Lecture  2 Data Types in computational biology/Systems biology Useful websites

A(n x p)

X(n x p)

PCA PLS

Y(n x q)

PC(n x p)

T(n x c)

U(n x c)max cov.

1 12

1 Decomposition step

2 Regression step

A = data matrixPC = principal component matrixn = # of observationsp = # of variables

n = # of observationsp = # of predictorsq = # of responsesc = # of extracted factors

X = matrix of predictorsY = matrix of responsesT = factors of predictorsU = factors of responses

Page 30: Lecture  2 Data Types in computational biology/Systems biology Useful websites

Principle Component Analysis (PCA) In Principal Component Analysis, we look for a few linear combinations of the

predictive variables which can be used to summarize the data without loosing too much information.

Intuitively, Principal components analysis is a method of extracting information from a higher dimensional data by projecting it to a lower dimension.

Example: Consider the scatter plot of a 3-dimentional data (3 variables). Data across the 3

variables are higly correlated and majority of the points cluster around the center of the space. This is also the direction of the 1st PC, which roughly gives equal weight to 3 variables

PC1 = – 0.56 X1 – 0.57 X2 – 0.59 X3

Page 31: Lecture  2 Data Types in computational biology/Systems biology Useful websites

Properties of Principal Components

• Var(PCi) = i

• Cov(PCi,PCj) = 0

• Var(PC1) Var(PC2) … Var(PCp)

Page 32: Lecture  2 Data Types in computational biology/Systems biology Useful websites

Numerical ExampleStudent Math Chem Phy Bio Eco Soc

A 7 8 7 8 7 7

B 8 7 7 6 8 7

C 9 7 8 7 6 7

D 7 7 7 7 9 8

E 7 6 6 6 8 8

F 7 7 7 7 8 8

G 6 6 6 7 7 7

H 9 8 8 6 6 6

I 8 8 8 7 6 6

J 7 7 6 6 8 9

The following is the high school grade of 10 students on 6 subjects (scale 1-10)• Math = Mathematics• Chem = Chemistry• Phy = Phisics• Bio = Biology• Eco = Economy• Soc = Sociology

Page 33: Lecture  2 Data Types in computational biology/Systems biology Useful websites

ResultsPC1 PC2 PC3 PC4 PC5 PC6

Eigenvalue 3.020 0.708 0.497 0.219 0.167 0.023

Proportion 0.652 0.153 0.107 0.047 0.036 0.005

Cumulative 0.652 0.804 0.912 0.959 0.995 1

Eigenvectors

Math 0.461 0.621 -0.088 0.168 0.267 -0.542

Chem 0.302 -0.059 -0.594 0.016 -0.740 -0.074

Phy 0.428 0.110 -0.365 -0.064 0.386 0.720

Bio 0.054 -0.666 -0.410 0.248 0.445 -0.355

Eco -0.533 0.271 -0.526 -0.559 0.185 -0.140

Soc -0.475 0.286 -0.248 0.771 -0.020 0.192

Page 34: Lecture  2 Data Types in computational biology/Systems biology Useful websites
Page 35: Lecture  2 Data Types in computational biology/Systems biology Useful websites

Partial Least Squares (PLS)

• Unlike PCA, the PLS technique works by successively extracting factors from both predictive and target variables such that covariance between the extracted factors is maximized

• Decomposition step X = TWt + E Y = UVt + F

• Regression step Y = TB + D = XWB + D = XBPLS + D; BPLS = WB

Page 36: Lecture  2 Data Types in computational biology/Systems biology Useful websites

Numerical ExampleStudent Math Chem Phy Bio Eco Soc GPA

A 7 8 7 8 7 7 2.9

B 8 7 7 6 8 7 3.1

C 9 7 8 7 6 7 3.6

D 7 7 7 7 9 8 3.3

E 7 6 6 6 8 8 3.0

F 7 7 7 7 8 8 2.9

G 6 6 6 7 7 7 3.2

H 9 8 8 6 6 6 3.4

I 8 8 8 7 6 6 2.8

J 7 7 6 6 8 9 3.5

The following is the high school grade of 10 students on 6 subjects (scale 1-10)• Math = Mathematics• Chem = Chemistry• Phy = Phisics• Bio = Biology• Eco = Economy• Soc = Sociology

and the corresponding GPA score during undergraduate level.

Objective: Can we use information of student’s performance during high school to predict their GPA score when they enter undergraduate level?

Page 37: Lecture  2 Data Types in computational biology/Systems biology Useful websites
Page 38: Lecture  2 Data Types in computational biology/Systems biology Useful websites

K-mean clustering

Page 39: Lecture  2 Data Types in computational biology/Systems biology Useful websites

Source: “Clustering Challenges in Biological Networks” edited by S. Butenko et. al.

Page 40: Lecture  2 Data Types in computational biology/Systems biology Useful websites

Source:Teknomo, Kardi. K-Means Clustering Tutorials http:\\people.revoledu.com\kardi\ tutorial\

kMean\

Page 41: Lecture  2 Data Types in computational biology/Systems biology Useful websites

1. Initial value of centroids: Suppose we use medicine A and medicine B as the first centroids. Let c1 and c2 denote the coordinate of the centroids, then c1 = (1,1) and c2 = (2,1)

Page 42: Lecture  2 Data Types in computational biology/Systems biology Useful websites
Page 43: Lecture  2 Data Types in computational biology/Systems biology Useful websites
Page 44: Lecture  2 Data Types in computational biology/Systems biology Useful websites