beyond co-expression: gene network inference

56
Beyond Co-expression: Gene Network Inference Patrik D’haeseleer Harvard University http:/genetics.med.harvard.edu/~patrik

Upload: faith

Post on 06-Feb-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Beyond Co-expression: Gene Network Inference. Patrik D’haeseleer Harvard University http:/genetics.med.harvard.edu/~patrik. Beyond Co-expression. Clustering approaches rely on co-expression of genes under different conditions Assumes co-expression is caused by co-regulation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Beyond Co-expression: Gene Network Inference

Beyond Co-expression:Gene Network Inference

Patrik D’haeseleer

Harvard University

http:/genetics.med.harvard.edu/~patrik

Page 2: Beyond Co-expression: Gene Network Inference

Beyond Co-expression

• Clustering approaches rely on co-expression of genes under different conditions

• Assumes co-expression is caused by co-regulation• We would like to do better than that:

– Causal inference– What is regulating what?

Page 3: Beyond Co-expression: Gene Network Inference

Gene Network Inference

Page 4: Beyond Co-expression: Gene Network Inference

Overview

• Modeling Issues:– Level of biochemical detail– Boolean or continuous?– Deterministic or stochastic?– Spatial or non-spatial?

• Data Requirements• Linear Models• Nonlinear models• Conclusions

Page 5: Beyond Co-expression: Gene Network Inference

Level of Biochemical Detail

• Detailed models require lots of data!• Highly detailed biochemical models are only feasible

for very small systems which are extensively studied• Example: Arkin et al. (1998), Genetics 149(4):1633-48

lysis-lysogeny switch in Lambda:

5 genes, 67 parameters based on 50 years of research, stochastic simulation required supercomputer

Page 6: Beyond Co-expression: Gene Network Inference

Example: Lysis-Lysogeny

Arkin et al. (1998), Genetics 149(4):1633-48

Page 7: Beyond Co-expression: Gene Network Inference

Level of Biochemical Detail

• In-depth biochemical simulation of e.g. a whole cell is infeasible (so far)

• Less detailed network models are useful when data is scarce and/or network structure is unknown

• Once network structure has been determined, we can refine the model

Page 8: Beyond Co-expression: Gene Network Inference

Boolean or Continuous?

• Boolean Networks (Kauffman (1993), The Origins of Order) assumes ON/OFF gene states.

• Allows analysis at the network-level• Provides useful insights in network dynamics• Algorithms for network inference from binary data

A

B

C C = A AND B

0

10

Page 9: Beyond Co-expression: Gene Network Inference

Boolean or Continuous?

• Boolean abstraction is poor fit to real data• Cannot model important concepts:

– amplification of a signal– subtraction and addition of signals– compensating for smoothly varying environmental parameter

(e.g. temperature, nutrients)– varying dynamical behavior (e.g. cell cycle period)

• Feedback control:negative feedback is used to stabilize expression

causes oscillation in Boolean model

Page 10: Beyond Co-expression: Gene Network Inference

Deterministic or Stochastic?

• Use of concentrations assumes individual molecules can be ignored

• Known examples (in prokaryotes) where stochastic fluctuations play an essential role (e.g. lysis-lysogeny in lambda)

• Requires stochastic simulation (Arkin et al. (1998), Genetics

149(4):1633-48), or modeling molecule counts (e.g. Petri nets, Goss and Peccoud (1998), PNAS 95(12):6750-5)

• Significantly increases model complexity

Page 11: Beyond Co-expression: Gene Network Inference

Deterministic or Stochastic?

• Eukaryotes: larger cell volume, typically longer half-lives. Few known stochastic effects.

• Yeast: 80% of the transcriptome is expressed at 0.1-2 mRNA copies/cell Holstege, et al.(1998), Cell 95:717-728.

• Human: 95% of transcriptome is expressed at <5 copies/cell Velculescu et al.(1997), Cell 88:243-251

Page 12: Beyond Co-expression: Gene Network Inference

Spatial or Non-Spatial

• Spatiality introduces additional complexity:– intercellular interactions– spatial differentiation– cell compartments– cell types

• Spatial patterns also provide more datae.g. stripe formation in Drosophila: Mjolsness et al. (1991), J. Theor. Biol. 152: 429-454.

• Few (no?) large-scale spatial gene expression data sets available so far.

Page 13: Beyond Co-expression: Gene Network Inference

Overview

• Modeling Issues:– Level of biochemical detail– Boolean or continuous?– Deterministic or stochastic?– Spatial or non-spatial?

• Data Requirements• Linear Models• Nonlinear models• Conclusions

Page 14: Beyond Co-expression: Gene Network Inference

Overview

• Modeling Issues• Data Requirements:

– Lower bounds from information theory– Effect of limited connectivity– Comparison with clustering– Variety of data points needed

• Linear Models• Nonlinear models• Conclusions

Page 15: Beyond Co-expression: Gene Network Inference

Lower Bounds from Information Theory

• How many bits of information are needed just to specify the connection pattern of a network?

• N2 possible connections between N nodes

N2 bits needed to specify which connections are present or absent

• O(N) bits of information per “data point”

O(N) data points needed

Page 16: Beyond Co-expression: Gene Network Inference

Effect of Limited Connectivity

• Assume only K inputs per gene (on average)

NK connections out of N2 possible:

possible connection patterns

• Number of bits needed to fully specify the connection pattern:

O(Klog(N/K)) data points needed

KNNKNK

Nloglog

2

2

NK

N 2

Page 17: Beyond Co-expression: Gene Network Inference

Comparison with clustering

• Use pairwise correlation comparisons as a stand-in for clustering

• As number of genes increases, number of false positives will increase as well need to use more stringent correlation test

• If we want to use the same correlation cutoff value r, we need to increase the number of data points as N increases

O(log(N)) data points needed

Page 18: Beyond Co-expression: Gene Network Inference

Summary

Fully connected N (thousands)

Connectivity K Klog(N/K) (hundreds?)

Clustering log(N) (tens)

• Additional constraints reduce data requirements:– limited connectivity– choice of regulatory functions

• Network inference is feasible, but does require much more data than clustering

Page 19: Beyond Co-expression: Gene Network Inference

Variety of Data Points Needed

• To unravel regulation of a gene, need to sample many different combinations of its regulatory inputs (different environmental conditions and perturbations)

• Time series data yields dynamics, but all data points are related

• Steady-state data yields attractors, can sample state space more efficiently

• Both types of data will be needed, and multiple data sets of each

Page 20: Beyond Co-expression: Gene Network Inference

Overview

• Modeling Issues• Data Requirements• Linear Models:

– Formulation– Underdetermined problem!– Solution 1: reduce N– Solution 2

• Nonlinear models• Conclusions

Page 21: Beyond Co-expression: Gene Network Inference

Linear Models

• Basic model: weighted sum of inputs

• Simple network representation:

• Only first-order approximation

• Parameters of the model:

weight matrix containing NxN interaction weights

• “Fitting” the model: find the parameters wji, bi such

that model best fits available data

w23

g1g2

g3g4

g5

w12

w55

j

ijjii btywtty )()( j

ijjii byw

dt

dyor

Page 22: Beyond Co-expression: Gene Network Inference

Underdetermined problem!

• Assumes fully connected network: need at least as many data points (arrays, conditions) as variables (genes)!

• Underdetermined (underconstrained, ill-posed) model: we have many more parameters than data values to fit

• No single solution, rather infinite number of parameter settings that will all fit the data equally well

Page 23: Beyond Co-expression: Gene Network Inference

Solution 1: reduce N

• Rather than trying to model all genes, we can reduce the dimensionality of the problem:

• Network of clusters: construct a linear model based on the cluster centroids– rat CNS data (4 clusters): Wahde and Hertz (2000),

Biosystems 55, 1-3:129-136. – yeast cell cycle (15-18 clusters): Mjolsness et al.(2000),

Advances in Neural Information Processing Systems 12; van Someren et al.(2000) ISMB2000, 355-366.

• Network of Principle Components: linear model between “characteristic modes” of the dataHolter et al.(2001), PNAS 98(4):1693-1698.

Page 24: Beyond Co-expression: Gene Network Inference

Solution 2:

• Take advantage of additional information:– replicates– accuracy of measurements– smoothness of time series– …

• Most likely, the network will still be poorly constrained.

Need a method to identify and extract those parts of the model that are well-determined and robust

Page 25: Beyond Co-expression: Gene Network Inference

What’s next?

• Regulatory motifs:once we have identified the corresponding DNA binding

proteins (transcription factors), we can start building the

gene network from there

• Integration with other data:– transcription factors– functional annotation– known interactions in the literature– protein-protein interactions– protein expression levels– genetic data– ...

Page 26: Beyond Co-expression: Gene Network Inference

Linking Regulatory Motifs to Expression Data

Patrik D’haeseleer

Harvard University

http:/genetics.med.harvard.edu/~patrik

Page 27: Beyond Co-expression: Gene Network Inference

Introduction

• Gene expression is regulated by Transcription Factors (TFs), that bind to specific regulatory motifs in the promoter region of the gene.

Synonyms: regulatory element, regulatory sequence, promoter elements, promoter motifs, (TF) binding site, operator (in prokaryotes), …

• Question: Do genes with similar expression patterns share regulatory motifs?

TF

regulatorymotif

DNAgene

Page 28: Beyond Co-expression: Gene Network Inference

1: Systematic Determination of Genetic Network Architecture

Time-point 1

Tim

e-po

int 3

Tim

e-po

int 2

-1.8

-1.3

-0.8

-0.3

0.2

0.7

1.2

1 2 3

-2

-1.5

-1

-0.5

0

0.5

1

1.5

1 2 3

-1.5

-1

-0.5

0

0.5

1

1.5

1 2 3

Time -pointTime -point

Time -point

Normalized

Expression

Normalized

Expression

Normalized

Expression

Tavazoie et al., Nature Genetics 22, 281 – 285 (1999)

Page 29: Beyond Co-expression: Gene Network Inference

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

300-600 bp of upstream sequence per gene are searched in

Saccharomyces cerevisiae.

Search for Motifs in Promoter Regions

Page 30: Beyond Co-expression: Gene Network Inference

AAAAGAGTCA

AAATGACTCA

AAGTGAGTCA

AAAAGAGTCA

GGATGAGTCA

AAATGAGTCA

GAATGAGTCA

AAAAGAGTCA

**********

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

Best Motif Found by AlignACE

Page 31: Beyond Co-expression: Gene Network Inference

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3

Replication & DNA synthesis (2)

s.d

. fr

om

mean

MCB

0

20

40

60

80

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

CLUSTER

Nu

mb

er

of

OR

Fs

05

1015

2025

3035

Distance from ATG (b.p.)

Nu

mb

er

of

site

s

MIPS Functional category (# ORFs)ORFs within

category

DNA synthesis and replication (82)Cell cycle control and mitosis (312)Recombination and DNA repair (84)Nuclear organization (720)

23301140

N=182

Page 32: Beyond Co-expression: Gene Network Inference

Systematic Determination of Genetic Network Architecture

• Tavazoie et al., Nature Genetics 22, 281–285 (1999)• Most motifs found are highly selective for the cluster

they were found in.• Can find many known binding sites for transcription

factors.• Also finds many novel regulatory motifs, associated

with specific functional categories.

1) cluster

2) identify regulatory motifs in clustered genes

3) identify TF’s that bind to those motifs

Gene regulation network

Page 33: Beyond Co-expression: Gene Network Inference

2: Regulatory Element Detection Using Correlation with Expression

• What is the contribution of each regulatory motif (or the TF that binds to that motif) to the expression level of the genes containing the motif?

• Given a set of known or putative regulatory motifs, identify all genes that contain the motif in their promoter region.

• For a single expression experiment (e.g. single point in a time series), is the presence of the motif correlated with the expression level of the genes?

• Perform multiple regression of (log) expression level on the presence/absence of the motifs.

• Plot contribution of motif throughout time series.

Bussemaker et al., Nature Genetics 27, 167 – 174 (2001)

Page 34: Beyond Co-expression: Gene Network Inference

: the presence of motif 1 is correlated with the expression levels of the genes in which it appears

: motif 2 is not correlated with expression levels of the genes in which it appears

: motif 3 is negatively correlated with expression levels of the genes in which it appears

Contribution of Motifs to Expression Levels

Page 35: Beyond Co-expression: Gene Network Inference

...332211 gggg NFNFNFCA

Linear Combination of Motif Contributions

• Find the most highly correlated motif.

• Determine its contribution Fi to expression level by linear regression.

• Subtract its contribution from the expression levels.• Find the next highest correlated motif.• Repeat until no more significantly correlated motifs.

• Repeat this entire analysis for each time point of a time series weights Fi for the individual motifs will change throughout he time course.

Page 36: Beyond Co-expression: Gene Network Inference

Time Courses of Regulatory Signals

...)()()()( 332211 tFNtFNtFNCtA gggg

• We can think of the time-varying contributions Fi of each motif as the Regulatory Signals of the transcription factors that bind to these motifs

Time (minutes) Time (minutes)

Page 37: Beyond Co-expression: Gene Network Inference

Regulatory Element Detection Using Correlation with Expression

• Bussemaker et al., Nature Genetics 27, 167–174 (2001)• Can be used with known regulatory motifs, sets of

putative motifs, and even exhaustively on the set of all motifs up to a certain length (n=7).

• Known motifs generally have high statistical significance.• Allows us to infer regulatory inputs of (possibly unknown)

transcription factors.• Accounts for only 30% of total signal present in genome-

wide expression patterns.• Purely linear model: no synergistic effects between TF’s,

cooperative binding, etc.

Page 38: Beyond Co-expression: Gene Network Inference

3: Identifying Regulatory Networks by Combinatorial Analysis

of Promoter Elements

• Most transcription factors are thought to work in concert with other TF’s.

Synergistic effects• Clustering:

– a motif may occur in more than one cluster, because it may give rise to different expression patterns depending on its interaction partners.

– several motifs may occur in the same cluster.

• Correlation with expression pattern:– by itself, a motif may not show a clear expression pattern.– contributions of multiple motifs may not be simply additive.

Pilpel et al., Nature Genetics 29, 153–159 (2001)

Page 39: Beyond Co-expression: Gene Network Inference

5 10 15

-2

0

2EC=0.05

5 10 15

-2

0

2EC=0.05

TimeTime

Exp

ress

ion

leve

l

SFF but not Mcm1 Mcm1 but not SFF

Time5 10 15

-2

0

2EC=0.23

Exp

ress

ion

leve

lSFF and Mcm1

Mcm1 and SFF were not detected in Tavazoie et al

Yet TFs that bind these motifs are known to interact in control of G2-genes (Nature. 2000 406:90-4.)

Bussemaker et al found that these motifs are antagonistic.

Synergy between Mcm1 and SFF in Cell Cycle Data Set

Page 40: Beyond Co-expression: Gene Network Inference

Expression Coherence and Synergy

• Expression Coherence (EC) score indicates how tightly clustered the expression profiles of a set of genes are.

• For every combination of N=2,3 motifs:1) Calculate the expression coherence score of the

genes that have the N motifs

2) Calculate the expression coherence score of genes that have every possible subset of N-1 motifs

3 )Test (statistically) the hypothesis that the score of the orfs with N motifs is significantly higher than that of orfs that have any sub set of N-1 motifs

Page 41: Beyond Co-expression: Gene Network Inference

.

0

0.5

1

-0.5

-1

0.2

0.4

0.6

0.8

G1G2

MCB MSE URS1 SCB MCM1' SFF'

MCB MSE URS1 SCB MCM1' SFF'

Correlation

Expression

Coherence

The “Combinogram”

Ho et al. Nature. 2002

Highly synergistic interaction between MCB and SFF

Previously unknown

Subsequently predicted via chromatin immuno-precipitation (ChIP)

(cell cycle data)

Page 42: Beyond Co-expression: Gene Network Inference

Identifying Regulatory Networks by Combinatorial Analysis

of Promoter Elements

• Pilpel et al., Nature Genetics 29, 153–159 (2001)• Found several known and novel interactions between

regulatory elements active in cell cycle, sporulation and stress response.

• Doesn’t assume a specific (e.g. linear) model of TF interactions.

• Combined with TF expression patterns, may allow us to infer a model of interaction.

Page 43: Beyond Co-expression: Gene Network Inference

Protein Networks

Patrik D’haeseleer

Harvard University

http:/genetics.med.harvard.edu/~patrik

Page 44: Beyond Co-expression: Gene Network Inference

Yeast 2-Hybrid Assays

“bait” fusion:

“prey” fusion:

ADBD

Prot1

ADBDProt2

Binding site Reporter gene

ADBD

Prot1

ADBDProt2

ADBD

Transcription Factor (e.g. Gal4)

Binding site Reporter gene

ADBDProt1 AD

BDProt2+

Fields and Song, Nature 340:245-246 (1989)

MATa MAT

Reconstitutedactive TF

Page 45: Beyond Co-expression: Gene Network Inference

Large-Scale 2-Hybrid Data Sets

• Uetz et al, Nature 403:623-627 (2000)– 6000 x 192 protein pairs screened using protein array– nearly all 6000 x 6000 pairs, using pooled prey libraries– total of 957 putative interactions between 1004 proteins

• Ito et al, PNAS 98:4569-4574 (2001)– nearly all 6000 x 6000 pairs, using bait and prey pools– total of 4549 putative interactions between 3278 proteins– core set of 841 interactions between 797 proteins

• Surprisingly little overlap between the data sets, possibly indicating a large number of missed interactions (false negative).

Page 46: Beyond Co-expression: Gene Network Inference

MIPS1546

Ito full4475

Uetz947

28

4954

1564242

1415

709

MIPS1546

1436

756

64821

2861

109

Uetz947

Ito core806

Intersections between Protein Interaction Data Sets

Page 47: Beyond Co-expression: Gene Network Inference

Causes of False Positives

• Bait acts as activator• Bait interacts with endogenous activator• Prey binds to DNA• Prey interacts with endogenous transcription factors• Bait interacts with Activation Domain• Prey interacts with DNA Binding Domain• “Sticky” proteins (nonspecific binding)• Changes in plasmid copy number• Various other artifacts• . . .

ADBD

Prot1

ADBDProt2

Binding site Reporter gene

Page 48: Beyond Co-expression: Gene Network Inference

Yeast Protein-Protein Interaction Map

Uetz, Schwikowski, Fields and co-workers; Ito and co-workers

Each node is a protein

Each line is an interaction

5560 putative interactions

3725 different proteins

~ 3 interactions / protein

Page 49: Beyond Co-expression: Gene Network Inference

MembraneProteins

TranscriptionFactors

Page 50: Beyond Co-expression: Gene Network Inference

- membrane protein

- DNA-binding protein

- all other yeast proteins

- physical interaction between two proteins

Page 51: Beyond Co-expression: Gene Network Inference

Possible Paths from Ste3:

1045 different paths to

143 transcription factors

Problem: How to Rank Possible Pathways?

Ste12

Ste2/3

Page 52: Beyond Co-expression: Gene Network Inference

Average Pairwise

Correlation Coefficient

Among Pathway Members

* Microarray Data Downloaded from Rosetta Inpharmatics

STE3AKR1STE5STE4FAR1CDC24SOH1 0.190

STE3AKR1IQG1CDC42BEM4RHO1SKN7 0.059

STE3AKR1STE4FAR1FUS3DIG2STE12 0.281

STE3AKR1GCS1YGL198WSAS10NET1 -0.106

Rank Predicted Paths by Degree of Expression Correlation from Microarray Expreriments

• Known pathways often show correlated expression

• Known interacting proteins often show correlated expression

Page 53: Beyond Co-expression: Gene Network Inference

Classical View of MAPK Pathways

adapted from C.Roberts, et al., Science, 287, 873 (2000)

Page 54: Beyond Co-expression: Gene Network Inference

The Protein Network View

• Highly interconnected, not just a linear pathway!

• Some proteins are missing from the protein interaction data sets (Cdc42, Ste20).

• Includes several additional proteins (especially Akr1, Kss1).

Page 55: Beyond Co-expression: Gene Network Inference

Conclusions

• Protein interaction data and expression data are both noisy. Combining them increases the accuracy.

• Can estimate protein interaction error rates by looking at consistency between data sets probabilistic interaction model (work in progress).

• Pathways are far more interconnected than often portrayed.

• Can integrate various other forms of data:– co-localization of proteins– homology with known interacting proteins– “Rosetta Stone” method

Page 56: Beyond Co-expression: Gene Network Inference

Acknowledgments

Roland SomogyiStefanie FuhrmanXiling Wen

UNM:Stephanie ForrestAndreas WagnerDavid PeabodyBarak Pearlmutter

NCGR:Jason StewartPedro Mendes

Harvard:Tzachi PilpelMartin SteffenAllegra PettiJohn AachGeorge Church

The Santa Fe Institute