1 predicting gene expression from sequence michael a. beer and saeed tavazoie cell 117, 185-198 (16...

20
1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, 185-198 (16 April 2004)

Post on 21-Dec-2015

223 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, 185-198 (16 April 2004)

1

Predicting Gene Expression from Sequence

Michael A. Beer and Saeed Tavazoie

Cell 117, 185-198 (16 April 2004)

Page 2: 1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, 185-198 (16 April 2004)

The Authors

Saeed Tavazoie (middle)ProfessorDept. of Molecular Biology

Mike BeerPostdoctoral ResearcherPh.D, Princeton (1995)

Page 3: 1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, 185-198 (16 April 2004)

The Question

• Transcription factor binding sites are relatively well-characterized in Saccharomyces cerevisiae

• But - the presence of a TF binding site alone is not sufficient to predict expression of a gene

• Multiple regulatory factors are often involved

• How do you identify the elaborate rules for gene regulation?

Page 4: 1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, 185-198 (16 April 2004)

Simple regulatory structures

Each possible combination of TFs must be tested in the lab;

This is a hugely time-consuming task..

Page 5: 1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, 185-198 (16 April 2004)

Problems with predicting gene regulation

Numerous transcription factors can bind to any one motif

Regulatory motif sequences have low consensus

e.g. The well known “TATA box” has aconsensus of TATA(A/T)A(A/T)(A/G)

Many genes have multiple known motifs upstream of ATG

Page 6: 1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, 185-198 (16 April 2004)

Example of cis-regulatory logic

From Yuh et al (1998), Science 279, 1896-1902

Page 7: 1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, 185-198 (16 April 2004)

The Approach

1. Using microarray expression data, the authors built clusters of genes with similar expression patterns.

From brain expression data in Wen et al (1998), PNAS 95, 334-339

Page 8: 1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, 185-198 (16 April 2004)

The Approach, con’t.

2. From groups of genes with similar expression patterns, a search is undertaken for consensus sequence motifs within 800bp upstream of ATG in each cluster.

Page 9: 1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, 185-198 (16 April 2004)

The Approach, con’t

3. The authors built a Markov model using the TF sequence motifs as parent nodes, and the expression data as data values.

4. This can be applied to a gene of interest by identifying the upstream TF motifs for that gene, and finding the model(s) that best fits the known upstream TF motifs.

5. If the expression data is within the parameters predicted by the model, then there is a decent chance that its associated gene regulatory structure can be verified experimentally.

Page 10: 1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, 185-198 (16 April 2004)

Two examples from yeast

Both clusters have at least 10 genes each, and there is some confidence that genes with the same upstream TFs will exhibit the same expression pattern as these clusters.

Page 11: 1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, 185-198 (16 April 2004)

Constructing the models

Using expression data from 30 microarrays, the authors identified 5547 genes with “significant” expression levels in yeast, and this data was used to construct 49 models of expression patterns.

Page 12: 1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, 185-198 (16 April 2004)

These 49 models were applied to five test sets of expression data, using only the upstream 800 bp region as input.

They found that the expression pattern was correctly predicted for 1898 genes out of the test set(s) of 2587 genes.

This amounts to 73% accuracy (random would be 1/49, or 2%).

Predictive accuracy

Page 13: 1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, 185-198 (16 April 2004)

Application to C. elegans

Given the larger amount of regulatory sequences in higher order organisms, and the potential for more complex regulation, the authors had low expectations for applying this model to C. elegans.

Using 2000 bp of upstream sequence, and microarray expression data including Hill (2000), the authors were surprised to learn that they could predict expression patterns for roughly half of the genes in the C. elegans dataset.

Page 14: 1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, 185-198 (16 April 2004)

An example from C. elegans

Page 15: 1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, 185-198 (16 April 2004)

Is it really so simple?

Gene regulation involves a complex combinatorial dance of numerous factors aside from the presence or absence of TF binding sites.

The authors have deliberately limited their scope to cis-acting upstream factors-- ignoring regulatory elements in introns or downstream regions, as well as the effects of operons, alternative splicing, histone modifications, methylation, et cetera

Page 16: 1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, 185-198 (16 April 2004)

Model constraints

Several bits of information were found to be significant factors in improving the predictive accuracy of the models:

A. Motif orientiation ( <--- or ---> )B. Distance from the start codonC. The particular order of various TFsD. The presence of multiple copies of the same TF

All of those factors were included in the model as priors.

Page 17: 1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, 185-198 (16 April 2004)

Why is distance from the start codon significant?

From Harbison et al (2004), Nature 431, 99-104

Page 18: 1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, 185-198 (16 April 2004)

The number of copies of a TF binding site is relevant..

From Molecular Biology of the Cell, 4th edition

Page 19: 1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, 185-198 (16 April 2004)

Motif combinatorics and predictive accuracy

The order of various TFs is significant

Combinatoric models are more accurate than single-TF models (unless a gene is under the control of only one TF).

Page 20: 1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, 185-198 (16 April 2004)

Future directions..

Because of the sensitivity of the model(s), even a very small amount of ambiguity can yield junk results.

For this reason, SAGE data is not particularly suitable, as only unique SAGE tags can be said to be unambiguous; this in turn excludes all sorts of potentially useful data.

However, we could use the microarray-based predictions to pick gene regulatory structures to investigate..