clustering of gene expression time series with conditional random fields yinyin yuan and chang-tsun...

19
Clustering of Gene Expression Time Series with Conditional Random Fields Yinyin Yuan and Chang-Tsun Li Computer Science Department

Upload: jose-payne

Post on 28-Mar-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Clustering of Gene Expression Time Series with Conditional Random Fields Yinyin Yuan and Chang-Tsun Li Computer Science Department

Clustering of Gene Expression Time Series with Conditional Random

Fields

Yinyin Yuan and Chang-Tsun LiComputer Science Department

Page 2: Clustering of Gene Expression Time Series with Conditional Random Fields Yinyin Yuan and Chang-Tsun Li Computer Science Department

Microarray and Gene Expression

• Microarray is a high throughput technique that can assay gene expression levels of a large number of genes in a tissue

• Gene expression level is the relative amounts of mRNA produced at specific time point and under certain experiment conditions.

• Thus microarray provides a mean to decipher the logic of gene regulation, by monitoring the gene expression of all genes in a tissue.

Page 3: Clustering of Gene Expression Time Series with Conditional Random Fields Yinyin Yuan and Chang-Tsun Li Computer Science Department

Gene Expression• Gene expression data are obtained from microarrays and

organized into gene expression matrix for analysis in various methodologies for medical and biological

purposes.

Page 4: Clustering of Gene Expression Time Series with Conditional Random Fields Yinyin Yuan and Chang-Tsun Li Computer Science Department

Gene Series Time Series• A sequence of gene expression measured at

successive time points at either uniform or uneven time intervals.

• Reveal more information than static data as time series data have strong correlations between successive points.

Time Series Clustering• Assumption: co-expression indicates co-regulation,

thus clustering identify genes that share similar functions.

Page 5: Clustering of Gene Expression Time Series with Conditional Random Fields Yinyin Yuan and Chang-Tsun Li Computer Science Department

Probabilistic models

A key challenge of gene expression time series research is the development of efficient and reliable probabilistic models

• Allow measurements of uncertainty • Give analytical measurement of the confidence of

the clustering result • Indicate the significance of a data point • Reflect temporal dependencies in the data points

Page 6: Clustering of Gene Expression Time Series with Conditional Random Fields Yinyin Yuan and Chang-Tsun Li Computer Science Department

Goal

• Identify highly informative genes

• Cluster genes in the dataset

• GO (Gene Ontology) analysis of biological function for each cluster.

Page 7: Clustering of Gene Expression Time Series with Conditional Random Fields Yinyin Yuan and Chang-Tsun Li Computer Science Department

HMMs and CRFs

• HMMs CRFs

• HMMs are trained to maximize the joint probability of a set of observed data and their corresponding labels.

• Independence assumptions are needed in order to be computationally tractable.

• Representing long-range dependencies between genes and gene interactions are computationally impossible.

Page 8: Clustering of Gene Expression Time Series with Conditional Random Fields Yinyin Yuan and Chang-Tsun Li Computer Science Department

Conditional Random Fields• CRFs are undirected graphical models that

define a probability distribution over the label sequences, globally conditioned on a set of observed features.

– X = {x1, x2,…, xn}: variable over the observations;– Y = {y1, y2,…, yn}: variable over the corresponding labels.– Observed data xj and class labels yj for all j in a voting pool Ni for sample xi;

Page 9: Clustering of Gene Expression Time Series with Conditional Random Fields Yinyin Yuan and Chang-Tsun Li Computer Science Department

CRFs Model

• The CRFs model can be expressed in a Gibbs form in terms of cost functions

• The CRFs model can be formulated as follows

Page 10: Clustering of Gene Expression Time Series with Conditional Random Fields Yinyin Yuan and Chang-Tsun Li Computer Science Department

Cost function

• The conditional random field model can also be expressed in a Gibbs form in terms of cost functions

• Cost function

Page 11: Clustering of Gene Expression Time Series with Conditional Random Fields Yinyin Yuan and Chang-Tsun Li Computer Science Department

Potential function

• Real-value potential functions are obtained and used to form the cost function

• D: the estimated threshold dividing the set of Euclidean distances into intra- and inter-class distances

Page 12: Clustering of Gene Expression Time Series with Conditional Random Fields Yinyin Yuan and Chang-Tsun Li Computer Science Department

Finding the optimal labels

• We adopt deterministic label selection, the optimal label is determined by

Page 13: Clustering of Gene Expression Time Series with Conditional Random Fields Yinyin Yuan and Chang-Tsun Li Computer Science Department

Pre-processing• Linear Warping for data alignment • τ -time point data transformed into τ-1feature

space Differences between consecutive time points inversely

proportional to time intervals are used as features as they can reflect the temporal structures in the time series.

• Voting pool: keeps one most similar sample, one most-different sample and k-2 randomly selected samples.

Page 14: Clustering of Gene Expression Time Series with Conditional Random Fields Yinyin Yuan and Chang-Tsun Li Computer Science Department

Process• Initialization

– Each sample is assigned a random label– Voting pools are formed randomly

• Samples interact with each other via its voting pool progressively– Update labels– Updata voting pool

• Until steady

Page 15: Clustering of Gene Expression Time Series with Conditional Random Fields Yinyin Yuan and Chang-Tsun Li Computer Science Department

Experimental Validation • Both biological dataset and simulated dataset• Adjusted Rand index: Similarity measure of two

partitions • Yeast galactose dataset

– Gene expression measurements in galactose utilization in Saccharomyces cerevisiae

– Subset of meansurements of 205 genes whose expression patterns reflect four functional categories in the Gene Ontology (GO) listings

– 4 repeated measurements across 20 time points

Page 16: Clustering of Gene Expression Time Series with Conditional Random Fields Yinyin Yuan and Chang-Tsun Li Computer Science Department

Results for Yeast galactose dataset• The four functional categories of

• Yeast galactose dataset

Experimental results on Yeast

galactose dataset

We obtained an average Rand index value of 0.943 in 10 experiments, greater than the result 0.7 in Tjaden et al. 2006.

Page 17: Clustering of Gene Expression Time Series with Conditional Random Fields Yinyin Yuan and Chang-Tsun Li Computer Science Department

Simulated Dataset• Data are generated for 400

genes across 20 time points from six artificial patterns to model periodic, up-regulated and down regulated gene expression profiles.

• High Gaussian noise is added.• Perfect partitions are obtained

with 10 iterations

Page 18: Clustering of Gene Expression Time Series with Conditional Random Fields Yinyin Yuan and Chang-Tsun Li Computer Science Department

Conclusions

• A novel unsupervised Conditional Random Fields model for efficient and accurate gene expression time series clustering

• All data points are randomly initialized

• The randomness of the voting pool facilitates global interactions

Page 19: Clustering of Gene Expression Time Series with Conditional Random Fields Yinyin Yuan and Chang-Tsun Li Computer Science Department

Future work

• Various similarity measurement

• Advantage of information from repeated measurements

• Training and testing procedures