mining the gene expression matrix: inferring gene relationships from large scale gene expression...
TRANSCRIPT
MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DATA
Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman, and Roland Somogyi
Information Processing in Cells and Tissues, pp. 203-212, 1998
Presented by Bin He
Motivations it is necessary to determine large-
scale temporal gene expression patterns
to decipher the logic of gene regulation, we should aim to be able to monitor the expression level of all genes simultaneously
Gene time series assay the expression levels of
large numbers of genes in a tissue at different time points
Gene time seriesthe relative amounts of mRNA produced at these time points provide a gene expression time series for each gene
Gene Expression Matrix Wen, X., Fuhrman, S., Michaels, G.S., Carr, D.B., Smith, S., Barker, J.L.,
and Somogyi, R., 1997, Large-scale temporal gene expression mapping of CNS development, Proc. Natl. Acad. Sci., in press
Previous Approach Euclidean distance and information
theoretic measures to cluster the genes into related expression time series
A significant problem with this approach is the variety of measures that can be used
Each measure produces a unique clustering of gene expression patterns
Contributions determining significant
relationships between individual genes, based on: linear correlation rank correlation information theory
Linear correlation ------positive correlation positive linear correlation
Linear correlation ------negative correlation negative linear correlation
Linear correlation ------restriction for 112 different genes, 112x111/2
= 6216 pairs of expression time series need to be examined
to restrict the number of relationships, we might want to test which correlations are significantly larger than a certain value
Linear correlation ------restriction For instance, to find those
relationships in which at least 50% of the variance is explained by the correlation, i.e. rho2>0.5, we need |r|>0.96 to reject at the 1% significance level the null hypothesis that |rho|<0.7071
Linear correlation ------visualization residual variance based distance
measurment d=1-r2
d=0 if perfectly correlated, d=1 if uncorrelated
multidimensional scaling map time series into a two-
dimensional plane
Linear correlation ------visualization Multidimensional scaling of 34 time
series with high correlation
Nonlinear correlation ------Model Spearman rank correlation, rs
measurement for monotonic relationships can be used for non-Gaussian distributions
491 pairs of expression time series, involving 98 genes, which have a significant rs, ranging from -0.979 to 0.996
Nonlinear correlation------Example
High rank correlation but low linear correlation between mGluR1 and GRa2
Information Theory ------mutual information if H(A) and H(B) are the entropies
of sources A and B respectively, and H(A,B) the joint entropy of the sources, then M(A,B) = H(A) + H(B) - H(A,B)
discrete form is much easier to use We need discretize the time series
by partitioning the expression levels into bins
Information Theory ------Bin size The fewer bins we use to discretize
the data, the more information about the original time series we ignore.
On the other hand, too fine a binning will leave us with too few points per bin to get a reasonable estimate of the frequency of each bin
Information Theory ------Mapping Some time series map to the same
discretized series In total, from 112 unique
continuous-valued time series we get 91 discretized time series
Information Theory ------Mapping
E11
E13
E15
E18
E21
P0 P7 P14
A genes
0 0 2 2 2 2 2 2 2 MAP2, pre-GAD67, GAT1
0 0 0 0 0 0 0 1 2 NFM, mGluR1, NMDA2A
0 0 0 1 1 1 1 2 2 S100 beta, GRg1
0 0 0 2 2 2 2 1 1 GAD67, mGluR5, NMDA1
Information Theory ------Mapping eliminate one-to-one mapping by
permuting the bin numbers H(A)=H(B)=M(A,B) row 3 and row 4
replace such time series by one single series, leaving us with a set of 77 unique, non-equivalent time series.
Information Theory ------Measurement symmetric measures
M(A,B)/max(H(A),H(B)) M(A,B)/H(A,B)
asymmetric measures Relative mutual information
R(A,B) = M(A,B)/H(B) R(A,B) = 1.0, means that all the information
about time series B is contained in time series A
Conclusion Linear correlation can be used very effectively
to detect linear relationships detect relationships not captured by Euclidean
distance, such as high negative correlations Rank correlation can be used to detect non-
linear relationships much more robust with respect to the distribution of
expression levels Information theory can be used to detect
genes whose (binned) expression patterns share information It will detect any mapping from time series A to B