mining the gene expression matrix: inferring gene relationships from large scale gene expression...

MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DATA

Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman, and Roland Somogyi

Information Processing in Cells and Tissues, pp. 203-212, 1998

Presented by Bin He

Motivations it is necessary to determine large-

scale temporal gene expression patterns

to decipher the logic of gene regulation, we should aim to be able to monitor the expression level of all genes simultaneously

Gene time series assay the expression levels of

large numbers of genes in a tissue at different time points

Gene time seriesthe relative amounts of mRNA produced at these time points provide a gene expression time series for each gene

Gene Expression Matrix Wen, X., Fuhrman, S., Michaels, G.S., Carr, D.B., Smith, S., Barker, J.L.,

and Somogyi, R., 1997, Large-scale temporal gene expression mapping of CNS development, Proc. Natl. Acad. Sci., in press

Previous Approach Euclidean distance and information

theoretic measures to cluster the genes into related expression time series

A significant problem with this approach is the variety of measures that can be used

Each measure produces a unique clustering of gene expression patterns

Contributions determining significant

relationships between individual genes, based on: linear correlation rank correlation information theory

Linear correlation ------positive correlation positive linear correlation

Linear correlation ------negative correlation negative linear correlation

Linear correlation ------restriction for 112 different genes, 112x111/2

= 6216 pairs of expression time series need to be examined

to restrict the number of relationships, we might want to test which correlations are significantly larger than a certain value

Linear correlation ------restriction For instance, to find those

relationships in which at least 50% of the variance is explained by the correlation, i.e. rho2>0.5, we need |r|>0.96 to reject at the 1% significance level the null hypothesis that |rho|<0.7071

Linear correlation ------visualization residual variance based distance

measurment d=1-r2

d=0 if perfectly correlated, d=1 if uncorrelated

multidimensional scaling map time series into a two-

dimensional plane

Linear correlation ------visualization Multidimensional scaling of 34 time

series with high correlation

Nonlinear correlation ------Model Spearman rank correlation, rs

measurement for monotonic relationships can be used for non-Gaussian distributions

491 pairs of expression time series, involving 98 genes, which have a significant rs, ranging from -0.979 to 0.996

Nonlinear correlation------Example

High rank correlation but low linear correlation between mGluR1 and GRa2

Information Theory ------mutual information if H(A) and H(B) are the entropies

of sources A and B respectively, and H(A,B) the joint entropy of the sources, then M(A,B) = H(A) + H(B) - H(A,B)

discrete form is much easier to use We need discretize the time series

by partitioning the expression levels into bins

Information Theory ------Bin size The fewer bins we use to discretize

the data, the more information about the original time series we ignore.

On the other hand, too fine a binning will leave us with too few points per bin to get a reasonable estimate of the frequency of each bin

Information Theory ------Mapping Some time series map to the same

discretized series In total, from 112 unique

continuous-valued time series we get 91 discretized time series

Information Theory ------Mapping

E11

E13

E15

E18

E21

P0 P7 P14

A genes

0 0 2 2 2 2 2 2 2 MAP2, pre-GAD67, GAT1

0 0 0 0 0 0 0 1 2 NFM, mGluR1, NMDA2A

0 0 0 1 1 1 1 2 2 S100 beta, GRg1

0 0 0 2 2 2 2 1 1 GAD67, mGluR5, NMDA1

Information Theory ------Mapping eliminate one-to-one mapping by

permuting the bin numbers H(A)=H(B)=M(A,B) row 3 and row 4

replace such time series by one single series, leaving us with a set of 77 unique, non-equivalent time series.

Information Theory ------Measurement symmetric measures

M(A,B)/max(H(A),H(B)) M(A,B)/H(A,B)

asymmetric measures Relative mutual information

R(A,B) = M(A,B)/H(B) R(A,B) = 1.0, means that all the information

about time series B is contained in time series A

Conclusion Linear correlation can be used very effectively

to detect linear relationships detect relationships not captured by Euclidean

distance, such as high negative correlations Rank correlation can be used to detect non-

linear relationships much more robust with respect to the distribution of

expression levels Information theory can be used to detect

genes whose (binned) expression patterns share information It will detect any mapping from time series A to B

mining the gene expression matrix: inferring gene relationships from large scale gene expression...

Documents

gene expression time

related expression time

gene time seriesassay

original time series

rholinear correlation

low linear correlation

inferring gene relationships

examplehigh rank correlation