computational methods for studying gene regulatory networks · 2008-05-22 · •overview...

42
Computational methods for studying gene regulatory networks Cornell Weill May 14 Dr. Christina Leslie

Upload: others

Post on 11-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

Computational methods for studyinggene regulatory networks

Cornell WeillMay 14

Dr. Christina Leslie

Page 2: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

Dual goals

• Overview computational approaches todeciphering “gene regulatory networks”

• Introduce basic statistical/machine learningconcepts:– Supervised vs. unsupervised learning– Probabilistic models: maximum likelihood estimate

(MLE), Bayesian methods– Training vs. test data, cross-validation– Generalization vs. overfitting: how to measure

statistical performance

Page 3: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

Levels of gene regulation

• Binding of regulatory proteins to gene’scontrol region• Chemical and structural modification ofDNA and chromatin

Alberts et al., MBCFigure 7-5

Page 4: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

Genome-wide expression data

• mRNA expression (microarrays)– cDNA vs. oligonucleotide arrays

Page 5: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

Finer resolution array data• Tiling arrays• Exon-scanning

exon-exonjunction arrays

• Also: arrayCGH (genomiccopy numbervariation),microRNAarrays

Shoemaker et al. 2001

Page 6: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

Pyrosequencing

• Solexa, 454, MPSS: expression databased on massively parallel sequencing

• Detect low transcript levels, ncRNAs

Page 7: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

ChIP chip

• Genome occupancy maps for– Transcription factors– Pol2– Chromatin regulators– Histone modifications

Page 8: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

ENCODE pilot project

Page 9: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

Transcriptional regulation

• “Very complex” in eukaryotes, inparticular in metazoans

• Most computational work on the yeastS. cerevisiae

Page 10: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

Noise in expression data

• Estimate p-value from replicate noise• (On board: multiple comparison issues)

Page 11: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

Clustering

• Find co-expressed genes (“clusters”)– Motif discovery algorithms for cis regulatory

elements (more later)– Enrichment of promoter sequences for

known DNA motifs– Functional annotations

• Problems– Co-expression ≠ co-regulation– Static cluster assumption– Statistical robustness of clusters?

Page 12: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

ML background

• Supervised learning– Labeled training data: (xi, yi), i = 1…m, e.g.

expression profile xi, tumor label yi– Learn a prediction function f: X→Y, can be used to

make new predictions on test data• Unsupervised learning

– Unlabeled data: xi, i = 1…m– E.g. clustering, dimensionality reduction

• Density estimation– Estimate probabilistic model P(x|Θ)– Used in supervised (e.g. P(y|x,Θ)) or unsupervised

settings

Page 13: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

Example: k-means clustering

• Next slides: k-means algorithm• Example of unsupervised learning• Analysis:

– Objective function, convergence– Dependence on initialization– Choice of k: overfitting, use of stability

measure

Page 14: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

Example: k-means clustering

• Given expression profiles:

assign to K clusters with centroids

• Notation: indicator variables

!

xg, g =1KM

!

µj, j =1KK

!

Zg

j=1,

0,

" # $

if gene j currentlyassigned to cluster j

otherwise

Page 15: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

K-means

• Initialize: random example• Repeat until assignments do not

change:!

µj"

!

j = argmin xg "µj

2

!

Zg

j=1,

0,

" # $ otherwise

if

!

µj"

Zg

jxg

g

#

Zg

j

g

#, j =1KK

Page 16: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

Some analysis

• Objective function: distortion measure

• Each iteration decreases J, converges tolocal min in finite # steps, depends oninitialization

• Choice of K?– Larger K reduces distortion, even on held-out data– Stability: subsample twice, measure similarity on

overlap in terms of paris of examples in samecluster

!

J = Zg

j

j=1

K

"g=1

M

" xg #µ j

2

Page 17: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

DNA motifs

• E.g. TF binding sites• k-mers vs PSSM =

position-specific scoringmatrix

• E.g. data for yeast Rox1motif

• Use data D to estimategenerative model,P(Msite|D,θ)

TGTATTGTTTCTATTGTTTCAATTGTTTGCTTTGTTCCCATTGTTCCGATTGTTCGCATTGTTCCTATTGTGGCTATTGTTTTTATTGTTGGCATTGTTCCTATTGTTTCCATTGTTCTCATTGTTTTCATTGTTCCTATTGTTCGTATTGTC

YSYATTGTT

Page 18: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

PSSMs• Motif model Mmotif

• Background model Mnull

• Log odds score

!

P(b1b2Lbk |Msite,") = pb11pb22

L pbkk, " = px

i{ }

!

P(b1b2Lbk |Mnull ,"o) = qb1qb2 Lqbk , "o = qx{ }

!

l(b) = logP(b |Msite )

P(b |Mnull )= log

pbii

qbi

"

# $ $

%

& ' '

i=1..k

(

Page 19: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

MLE

• Estimate parameters from data D:maximize likelihood function

!

max" logP(D |Msite,") =max" logP(b |M

site,")

b#D

$

!

" pxi

=Cx

i

Cx

i

x

# Maximum likelihood estimate =sample frequency

Page 20: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

Results on test data

Donor splice site (5’) motif

Page 21: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

Enrichment in a cluster

• Given a set/cluster of genes, arepromoter sequences enriched for a(known)TF binding site?– Z-score: treat each (overlapping) k-mer

window as binomial trial, with po=“background frequency”

!

C " Np0

Npo(1" po)

Score for Coccurrences of motifin N windows

Page 22: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

Enrichment in a cluster

• Other enrichment statistics– Hypergeometric p-value: probability of

obtaining at least k promoters with motif

!

m

i

"

# $ %

& ' N (m

n ( i

"

# $

%

& '

N

n

"

# $ %

& '

i= k..n

)N=total #genes

m=total #genes with motif

n=size of cluster

Page 23: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

Bayesian statistics

• Use of pseudocounts for smoothing• Bayesian estimates:

– Bayes rule– Posterior mean estimate– Dirichlet prior

Page 24: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

Pseudocounts and Priors

• Smooth MLE with pseudocounts

• Posterior probability

!

pxi

=Cx

i+"x

i

Cx

i+"x

i

x

#

!

P(" |D,M) =P(D |",M)P(" |M)

P(D |M)

likelihood prior

evidence

Page 25: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

Pseudocounts and priors

• Posterior mean estimate:

• Can show

for Dirichlet prior!

"PME= "P(" |D)d# "

!

"PME# smoothing with

pseudocounts

!

"x

i{ }

!

D(p |") =1

Z(")px

x

#"x $1

% px $1x

&'

( )

*

+ ,

Page 26: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

Limitations of PSSMs

• Biological issues:– Other information needed: chromatin

structure, interactions with other factors– Model may fail to capture sequence

features of site• Learning issues:

– Generative vs. discriminative– Probability density vs. learning a classifier

Page 27: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

REDUCE

• REDUCE = Regulatory element discoveryusing correlation with expression[Bussemaker et al., 2001]

• Supervised learning (regression)F: X →Y, motif counts → log fold change

• Linear model

!

Nµg{ }" Ag

!

Ag = C + FµNµg

µ"M

#

Page 28: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

REDUCE

• Normalize: mean 0, variance G=#genes

where

!

a1

L aG( ) = f1

L fµ( )O

nµ1 L nµG

O

"

#

$ $ $

%

&

' ' '

!

a = fµ nµ

µ"M

#

!

ag =Ag "

ˆ A

G var(A)

!

nµg =Nµg "

ˆ N µ

G var(Nµ )

Page 29: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

REDUCE

• Iteratively select motif with biggestreduction in chi-square

• Compute residual

• Significance: !

" 2 = a # fµ nµ

2

= a # a $ nµ( )nµ

2

= 1# a $ nµ( )2

!

a'= a " fµ nµ

(selected )µ

#

!

"#µ

2= a $ nµ( )

2

=Zµ

2

G, Zµ

2~ N(0,1)

Page 30: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

REDUCE

Page 31: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

MEME• Given cluster of genes, find overrepresented

motif (PSSM)– Learn location and parameters of PSSM

k-length window at position j, sequence I

indicator (0 or 1) whether window is motif

class probability– (On board, picture of data and variables)!

Xij

Zij

!

" = "o "1( ) =

pAo

pTo

pA1

pT1

pAk

pTk

#

$

% % %

&

'

( ( (

!

P(Z =1) = "

Page 32: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

Expectation Maximization• Chicken and egg problem:

– If we knew Zij, could estimate parameters θand λ

– If we knew parameters θ and λ, we couldcompute the probability that Zij=1

• Idea:– Replace Zij by its expectation

– Iteratively recompute estimates for Zij, thenθ and λ

!

P(Zij =1 | Xij ,",#)

Page 33: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

EM algorithm

• Initialize parameters θ and λ• E-step:

– Compute expected values of missing information,given current parameters and the data

!

Zij

(t )"

P(Zij =1 | Xij ,#(t ),$( t ))

=P(Xij |#1

( t ))$( t )

P(Xij |#o(t ))(1% $(t )) + P(Xij |#1

( t ))$( t )

Page 34: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

EM algorithm

• M-step:– Estimate parameters using expected

counts, based on current Zij

!

" (t+1) : pxl

=Cx

l+#x

l

Cx

l+#x

l

x

$

!

Cx

l = Zij

(t )

j

"i

" Ix i, j + l #1( )

!

"(t+1) =1

nmZij

(t )

j=1..m

#i=1..n

#

Page 35: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

Nucleosome positioning• “Chromatin decouples promoter threshold from

dynamic range”, Lam et al. Nature 2008

• Suggests affinity of TF site in nucleosome-free regiondetermines level of physiological stimulus required foractivation

Page 36: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

Cis Regulatory Modules• Homotypic + heterotypic clusters of TF

binding sites → functional regulatorysequence

[Gupta &Liu, PNAS2005]

Page 37: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

Bayesian networks

• Learn conditional (in)dependencies betweenmRNA expression levels of genes

• (Draw example, joint distribution)• Issues:

– Biological issues: mRNA as proxy for activity– Interpretation: edges ≠ causal direction– Statistical and computational problems

Page 38: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

Bayesian networks

• Graphical models– Genes: random variables– Graph encodes Markov independencies: “every

variable independent of its non-descendants,conditioned on its parents”

!

P(X1,K,Xn) = P(X

i|Pa(X

i))

i=1..n

"

Page 39: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

Structure learning• Optimize Bayesian score (via heuristic

search) over structures S:

• Pros:– Factorizes into local components

• Cons:– Exponential search space– Many local maxima– Undersampled: 1000s of variables, few 100 joint

observations

!

logP(S |D) = logP(D | S) + logP(S) + Cstructure priorlikelihood

Page 40: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

Bayesian networks: example

• Example: ~300 array compendium inyeast, 1 subnetwork similar to matingresponse

SST2KAR4

TEC1 SLT2 KSS1

YLR343W

YLR334C SLT2 STE6

FUS1 PRM1 AGA1

FIG1 FUS3AGA2 TOM6

YEL059W Pe’er et al, 2001

Page 41: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

Bayesian networks: example

• Extra technical points– Model knock-outs as “graph interventions”

(modify counts): more causal edges– Bootstrapping to estimate confident

features (like edges)– Statistical test to score subnetworks of

edges

Page 42: Computational methods for studying gene regulatory networks · 2008-05-22 · •Overview computational approaches to deciphering “gene regulatory networks” •Introduce basic

Some recent models

(Beer et al, 2004)

(Segal et al, 2003)