computational methods for studying gene regulatory networks · 2008-05-22 · •overview...

Computational methods for studyinggene regulatory networks

Cornell WeillMay 14

Dr. Christina Leslie

Dual goals

• Overview computational approaches todeciphering “gene regulatory networks”

• Introduce basic statistical/machine learningconcepts:– Supervised vs. unsupervised learning– Probabilistic models: maximum likelihood estimate

(MLE), Bayesian methods– Training vs. test data, cross-validation– Generalization vs. overfitting: how to measure

statistical performance

Levels of gene regulation

• Binding of regulatory proteins to gene’scontrol region• Chemical and structural modification ofDNA and chromatin

Alberts et al., MBCFigure 7-5

Genome-wide expression data

• mRNA expression (microarrays)– cDNA vs. oligonucleotide arrays

Finer resolution array data• Tiling arrays• Exon-scanning

exon-exonjunction arrays

• Also: arrayCGH (genomiccopy numbervariation),microRNAarrays

Shoemaker et al. 2001

Pyrosequencing

• Solexa, 454, MPSS: expression databased on massively parallel sequencing

• Detect low transcript levels, ncRNAs

ChIP chip

• Genome occupancy maps for– Transcription factors– Pol2– Chromatin regulators– Histone modifications

ENCODE pilot project

Transcriptional regulation

• “Very complex” in eukaryotes, inparticular in metazoans

• Most computational work on the yeastS. cerevisiae

Noise in expression data

• Estimate p-value from replicate noise• (On board: multiple comparison issues)

Clustering

• Find co-expressed genes (“clusters”)– Motif discovery algorithms for cis regulatory

elements (more later)– Enrichment of promoter sequences for

known DNA motifs– Functional annotations

• Problems– Co-expression ≠ co-regulation– Static cluster assumption– Statistical robustness of clusters?

ML background

• Supervised learning– Labeled training data: (xi, yi), i = 1…m, e.g.

expression profile xi, tumor label yi– Learn a prediction function f: X→Y, can be used to

make new predictions on test data• Unsupervised learning

– Unlabeled data: xi, i = 1…m– E.g. clustering, dimensionality reduction

• Density estimation– Estimate probabilistic model P(x|Θ)– Used in supervised (e.g. P(y|x,Θ)) or unsupervised

settings

Example: k-means clustering

• Next slides: k-means algorithm• Example of unsupervised learning• Analysis:

– Objective function, convergence– Dependence on initialization– Choice of k: overfitting, use of stability

measure

Example: k-means clustering

• Given expression profiles:

assign to K clusters with centroids

• Notation: indicator variables

!

xg, g =1KM

!

µj, j =1KK

!

Zg

j=1,

0,

" # $

if gene j currentlyassigned to cluster j

otherwise

K-means

• Initialize: random example• Repeat until assignments do not

change:!

µj"

!

j = argmin xg "µj

2

!

Zg

j=1,

0,

" # $ otherwise

if

!

µj"

Zg

jxg

g

#

Zg

j

g

#, j =1KK

Some analysis

• Objective function: distortion measure

• Each iteration decreases J, converges tolocal min in finite # steps, depends oninitialization

• Choice of K?– Larger K reduces distortion, even on held-out data– Stability: subsample twice, measure similarity on

overlap in terms of paris of examples in samecluster

!

J = Zg

j

j=1

K

"g=1

M

" xg #µ j

2

DNA motifs

• E.g. TF binding sites• k-mers vs PSSM =

position-specific scoringmatrix

• E.g. data for yeast Rox1motif

• Use data D to estimategenerative model,P(Msite|D,θ)

TGTATTGTTTCTATTGTTTCAATTGTTTGCTTTGTTCCCATTGTTCCGATTGTTCGCATTGTTCCTATTGTGGCTATTGTTTTTATTGTTGGCATTGTTCCTATTGTTTCCATTGTTCTCATTGTTTTCATTGTTCCTATTGTTCGTATTGTC

YSYATTGTT

PSSMs• Motif model Mmotif

• Background model Mnull

• Log odds score

!

P(b1b2Lbk |Msite,") = pb11pb22

L pbkk, " = px

i{ }

!

P(b1b2Lbk |Mnull ,"o) = qb1qb2 Lqbk , "o = qx{ }

!

l(b) = logP(b |Msite )

P(b |Mnull )= log

pbii

qbi

"

# $ $

%

& ' '

i=1..k

(

MLE

• Estimate parameters from data D:maximize likelihood function

!

max" logP(D |Msite,") =max" logP(b |M

site,")

b#D

$

!

" pxi

=Cx

i

Cx

i

x

# Maximum likelihood estimate =sample frequency

Results on test data

Donor splice site (5’) motif

Enrichment in a cluster

• Given a set/cluster of genes, arepromoter sequences enriched for a(known)TF binding site?– Z-score: treat each (overlapping) k-mer

window as binomial trial, with po=“background frequency”

!

C " Np0

Npo(1" po)

Score for Coccurrences of motifin N windows

Enrichment in a cluster

• Other enrichment statistics– Hypergeometric p-value: probability of

obtaining at least k promoters with motif

!

m

i

"

# $ %

& ' N (m

n ( i

"

# $

%

& '

N

n

"

# $ %

& '

i= k..n

)N=total #genes

m=total #genes with motif

n=size of cluster

Bayesian statistics

• Use of pseudocounts for smoothing• Bayesian estimates:

– Bayes rule– Posterior mean estimate– Dirichlet prior

Pseudocounts and Priors

• Smooth MLE with pseudocounts

• Posterior probability

!

pxi

=Cx

i+"x

i

Cx

i+"x

i

x

#

!

P(" |D,M) =P(D |",M)P(" |M)

P(D |M)

likelihood prior

evidence

Pseudocounts and priors

• Posterior mean estimate:

• Can show

for Dirichlet prior!

"PME= "P(" |D)d# "

!

"PME# smoothing with

pseudocounts

!

"x

i{ }

!

D(p |") =1

Z(")px

x

#"x $1

% px $1x

&'

( )

*

+ ,

Limitations of PSSMs

• Biological issues:– Other information needed: chromatin

structure, interactions with other factors– Model may fail to capture sequence

features of site• Learning issues:

– Generative vs. discriminative– Probability density vs. learning a classifier

REDUCE

• REDUCE = Regulatory element discoveryusing correlation with expression[Bussemaker et al., 2001]

• Supervised learning (regression)F: X →Y, motif counts → log fold change

• Linear model

!

Nµg{ }" Ag

!

Ag = C + FµNµg

µ"M

#

REDUCE

• Normalize: mean 0, variance G=#genes

where

!

a1

L aG( ) = f1

L fµ( )O

nµ1 L nµG

O

"

#

$ $ $

%

&

' ' '

!

a = fµ nµ

µ"M

#

!

ag =Ag "

ˆ A

G var(A)

!

nµg =Nµg "

ˆ N µ

G var(Nµ )

REDUCE

• Iteratively select motif with biggestreduction in chi-square

• Compute residual

• Significance: !

" 2 = a # fµ nµ

2

= a # a $ nµ( )nµ

2

= 1# a $ nµ( )2

!

a'= a " fµ nµ

(selected )µ

#

!

"#µ

2= a $ nµ( )

2

=Zµ

2

G, Zµ

2~ N(0,1)

REDUCE

MEME• Given cluster of genes, find overrepresented

motif (PSSM)– Learn location and parameters of PSSM

k-length window at position j, sequence I

indicator (0 or 1) whether window is motif

class probability– (On board, picture of data and variables)!

Xij

Zij

!

" = "o "1( ) =

pAo

pTo

pA1

pT1

pAk

pTk

#

$

% % %

&

'

( ( (

!

P(Z =1) = "

Expectation Maximization• Chicken and egg problem:

– If we knew Zij, could estimate parameters θand λ

– If we knew parameters θ and λ, we couldcompute the probability that Zij=1

• Idea:– Replace Zij by its expectation

– Iteratively recompute estimates for Zij, thenθ and λ

!

P(Zij =1 | Xij ,",#)

EM algorithm

• Initialize parameters θ and λ• E-step:

– Compute expected values of missing information,given current parameters and the data

!

Zij

(t )"

P(Zij =1 | Xij ,#(t ),$( t ))

=P(Xij |#1

( t ))$( t )

P(Xij |#o(t ))(1% $(t )) + P(Xij |#1

( t ))$( t )

EM algorithm

• M-step:– Estimate parameters using expected

counts, based on current Zij

!

" (t+1) : pxl

=Cx

l+#x

l

Cx

l+#x

l

x

$

!

Cx

l = Zij

(t )

j

"i

" Ix i, j + l #1( )

!

"(t+1) =1

nmZij

(t )

j=1..m

#i=1..n

#

Nucleosome positioning• “Chromatin decouples promoter threshold from

dynamic range”, Lam et al. Nature 2008

• Suggests affinity of TF site in nucleosome-free regiondetermines level of physiological stimulus required foractivation

Cis Regulatory Modules• Homotypic + heterotypic clusters of TF

binding sites → functional regulatorysequence

[Gupta &Liu, PNAS2005]

Bayesian networks

• Learn conditional (in)dependencies betweenmRNA expression levels of genes

• (Draw example, joint distribution)• Issues:

– Biological issues: mRNA as proxy for activity– Interpretation: edges ≠ causal direction– Statistical and computational problems

Bayesian networks

• Graphical models– Genes: random variables– Graph encodes Markov independencies: “every

variable independent of its non-descendants,conditioned on its parents”

!

P(X1,K,Xn) = P(X

i|Pa(X

i))

i=1..n

"

Structure learning• Optimize Bayesian score (via heuristic

search) over structures S:

• Pros:– Factorizes into local components

• Cons:– Exponential search space– Many local maxima– Undersampled: 1000s of variables, few 100 joint

observations

!

logP(S |D) = logP(D | S) + logP(S) + Cstructure priorlikelihood

Bayesian networks: example

• Example: ~300 array compendium inyeast, 1 subnetwork similar to matingresponse

SST2KAR4

TEC1 SLT2 KSS1

YLR343W

YLR334C SLT2 STE6

FUS1 PRM1 AGA1

FIG1 FUS3AGA2 TOM6

YEL059W Pe’er et al, 2001

Bayesian networks: example

• Extra technical points– Model knock-outs as “graph interventions”

(modify counts): more causal edges– Bootstrapping to estimate confident

features (like edges)– Statistical test to score subnetworks of

edges

Some recent models

(Beer et al, 2004)

(Segal et al, 2003)

computational methods for studying gene regulatory networks · 2008-05-22 · •overview...

Documents