computational methods for studying gene regulatory networks · 2008-05-22 · •overview...
TRANSCRIPT
Computational methods for studyinggene regulatory networks
Cornell WeillMay 14
Dr. Christina Leslie
Dual goals
• Overview computational approaches todeciphering “gene regulatory networks”
• Introduce basic statistical/machine learningconcepts:– Supervised vs. unsupervised learning– Probabilistic models: maximum likelihood estimate
(MLE), Bayesian methods– Training vs. test data, cross-validation– Generalization vs. overfitting: how to measure
statistical performance
Levels of gene regulation
• Binding of regulatory proteins to gene’scontrol region• Chemical and structural modification ofDNA and chromatin
Alberts et al., MBCFigure 7-5
Genome-wide expression data
• mRNA expression (microarrays)– cDNA vs. oligonucleotide arrays
Finer resolution array data• Tiling arrays• Exon-scanning
exon-exonjunction arrays
• Also: arrayCGH (genomiccopy numbervariation),microRNAarrays
Shoemaker et al. 2001
Pyrosequencing
• Solexa, 454, MPSS: expression databased on massively parallel sequencing
• Detect low transcript levels, ncRNAs
ChIP chip
• Genome occupancy maps for– Transcription factors– Pol2– Chromatin regulators– Histone modifications
ENCODE pilot project
Transcriptional regulation
• “Very complex” in eukaryotes, inparticular in metazoans
• Most computational work on the yeastS. cerevisiae
Noise in expression data
• Estimate p-value from replicate noise• (On board: multiple comparison issues)
Clustering
• Find co-expressed genes (“clusters”)– Motif discovery algorithms for cis regulatory
elements (more later)– Enrichment of promoter sequences for
known DNA motifs– Functional annotations
• Problems– Co-expression ≠ co-regulation– Static cluster assumption– Statistical robustness of clusters?
ML background
• Supervised learning– Labeled training data: (xi, yi), i = 1…m, e.g.
expression profile xi, tumor label yi– Learn a prediction function f: X→Y, can be used to
make new predictions on test data• Unsupervised learning
– Unlabeled data: xi, i = 1…m– E.g. clustering, dimensionality reduction
• Density estimation– Estimate probabilistic model P(x|Θ)– Used in supervised (e.g. P(y|x,Θ)) or unsupervised
settings
Example: k-means clustering
• Next slides: k-means algorithm• Example of unsupervised learning• Analysis:
– Objective function, convergence– Dependence on initialization– Choice of k: overfitting, use of stability
measure
Example: k-means clustering
• Given expression profiles:
assign to K clusters with centroids
• Notation: indicator variables
!
xg, g =1KM
!
µj, j =1KK
!
Zg
j=1,
0,
" # $
if gene j currentlyassigned to cluster j
otherwise
K-means
• Initialize: random example• Repeat until assignments do not
change:!
µj"
!
j = argmin xg "µj
2
!
Zg
j=1,
0,
" # $ otherwise
if
!
µj"
Zg
jxg
g
#
Zg
j
g
#, j =1KK
Some analysis
• Objective function: distortion measure
• Each iteration decreases J, converges tolocal min in finite # steps, depends oninitialization
• Choice of K?– Larger K reduces distortion, even on held-out data– Stability: subsample twice, measure similarity on
overlap in terms of paris of examples in samecluster
!
J = Zg
j
j=1
K
"g=1
M
" xg #µ j
2
DNA motifs
• E.g. TF binding sites• k-mers vs PSSM =
position-specific scoringmatrix
• E.g. data for yeast Rox1motif
• Use data D to estimategenerative model,P(Msite|D,θ)
TGTATTGTTTCTATTGTTTCAATTGTTTGCTTTGTTCCCATTGTTCCGATTGTTCGCATTGTTCCTATTGTGGCTATTGTTTTTATTGTTGGCATTGTTCCTATTGTTTCCATTGTTCTCATTGTTTTCATTGTTCCTATTGTTCGTATTGTC
YSYATTGTT
PSSMs• Motif model Mmotif
• Background model Mnull
• Log odds score
!
P(b1b2Lbk |Msite,") = pb11pb22
L pbkk, " = px
i{ }
!
P(b1b2Lbk |Mnull ,"o) = qb1qb2 Lqbk , "o = qx{ }
!
l(b) = logP(b |Msite )
P(b |Mnull )= log
pbii
qbi
"
# $ $
%
& ' '
i=1..k
(
MLE
• Estimate parameters from data D:maximize likelihood function
!
max" logP(D |Msite,") =max" logP(b |M
site,")
b#D
$
!
" pxi
=Cx
i
Cx
i
x
# Maximum likelihood estimate =sample frequency
Results on test data
Donor splice site (5’) motif
Enrichment in a cluster
• Given a set/cluster of genes, arepromoter sequences enriched for a(known)TF binding site?– Z-score: treat each (overlapping) k-mer
window as binomial trial, with po=“background frequency”
!
C " Np0
Npo(1" po)
Score for Coccurrences of motifin N windows
Enrichment in a cluster
• Other enrichment statistics– Hypergeometric p-value: probability of
obtaining at least k promoters with motif
!
m
i
"
# $ %
& ' N (m
n ( i
"
# $
%
& '
N
n
"
# $ %
& '
i= k..n
)N=total #genes
m=total #genes with motif
n=size of cluster
Bayesian statistics
• Use of pseudocounts for smoothing• Bayesian estimates:
– Bayes rule– Posterior mean estimate– Dirichlet prior
Pseudocounts and Priors
• Smooth MLE with pseudocounts
• Posterior probability
!
pxi
=Cx
i+"x
i
Cx
i+"x
i
x
#
!
P(" |D,M) =P(D |",M)P(" |M)
P(D |M)
likelihood prior
evidence
Pseudocounts and priors
• Posterior mean estimate:
• Can show
for Dirichlet prior!
"PME= "P(" |D)d# "
!
"PME# smoothing with
pseudocounts
!
"x
i{ }
!
D(p |") =1
Z(")px
x
#"x $1
% px $1x
&'
( )
*
+ ,
Limitations of PSSMs
• Biological issues:– Other information needed: chromatin
structure, interactions with other factors– Model may fail to capture sequence
features of site• Learning issues:
– Generative vs. discriminative– Probability density vs. learning a classifier
REDUCE
• REDUCE = Regulatory element discoveryusing correlation with expression[Bussemaker et al., 2001]
• Supervised learning (regression)F: X →Y, motif counts → log fold change
• Linear model
!
Nµg{ }" Ag
!
Ag = C + FµNµg
µ"M
#
REDUCE
• Normalize: mean 0, variance G=#genes
where
!
a1
L aG( ) = f1
L fµ( )O
nµ1 L nµG
O
"
#
$ $ $
%
&
' ' '
!
a = fµ nµ
µ"M
#
!
ag =Ag "
ˆ A
G var(A)
!
nµg =Nµg "
ˆ N µ
G var(Nµ )
REDUCE
• Iteratively select motif with biggestreduction in chi-square
• Compute residual
• Significance: !
" 2 = a # fµ nµ
2
= a # a $ nµ( )nµ
2
= 1# a $ nµ( )2
!
a'= a " fµ nµ
(selected )µ
#
!
"#µ
2= a $ nµ( )
2
=Zµ
2
G, Zµ
2~ N(0,1)
REDUCE
MEME• Given cluster of genes, find overrepresented
motif (PSSM)– Learn location and parameters of PSSM
k-length window at position j, sequence I
indicator (0 or 1) whether window is motif
class probability– (On board, picture of data and variables)!
Xij
Zij
!
" = "o "1( ) =
pAo
pTo
pA1
pT1
pAk
pTk
#
$
% % %
&
'
( ( (
!
P(Z =1) = "
Expectation Maximization• Chicken and egg problem:
– If we knew Zij, could estimate parameters θand λ
– If we knew parameters θ and λ, we couldcompute the probability that Zij=1
• Idea:– Replace Zij by its expectation
– Iteratively recompute estimates for Zij, thenθ and λ
!
P(Zij =1 | Xij ,",#)
EM algorithm
• Initialize parameters θ and λ• E-step:
– Compute expected values of missing information,given current parameters and the data
!
Zij
(t )"
P(Zij =1 | Xij ,#(t ),$( t ))
=P(Xij |#1
( t ))$( t )
P(Xij |#o(t ))(1% $(t )) + P(Xij |#1
( t ))$( t )
EM algorithm
• M-step:– Estimate parameters using expected
counts, based on current Zij
!
" (t+1) : pxl
=Cx
l+#x
l
Cx
l+#x
l
x
$
!
Cx
l = Zij
(t )
j
"i
" Ix i, j + l #1( )
!
"(t+1) =1
nmZij
(t )
j=1..m
#i=1..n
#
Nucleosome positioning• “Chromatin decouples promoter threshold from
dynamic range”, Lam et al. Nature 2008
• Suggests affinity of TF site in nucleosome-free regiondetermines level of physiological stimulus required foractivation
Cis Regulatory Modules• Homotypic + heterotypic clusters of TF
binding sites → functional regulatorysequence
[Gupta &Liu, PNAS2005]
Bayesian networks
• Learn conditional (in)dependencies betweenmRNA expression levels of genes
• (Draw example, joint distribution)• Issues:
– Biological issues: mRNA as proxy for activity– Interpretation: edges ≠ causal direction– Statistical and computational problems
Bayesian networks
• Graphical models– Genes: random variables– Graph encodes Markov independencies: “every
variable independent of its non-descendants,conditioned on its parents”
!
P(X1,K,Xn) = P(X
i|Pa(X
i))
i=1..n
"
Structure learning• Optimize Bayesian score (via heuristic
search) over structures S:
• Pros:– Factorizes into local components
• Cons:– Exponential search space– Many local maxima– Undersampled: 1000s of variables, few 100 joint
observations
!
logP(S |D) = logP(D | S) + logP(S) + Cstructure priorlikelihood
Bayesian networks: example
• Example: ~300 array compendium inyeast, 1 subnetwork similar to matingresponse
SST2KAR4
TEC1 SLT2 KSS1
YLR343W
YLR334C SLT2 STE6
FUS1 PRM1 AGA1
FIG1 FUS3AGA2 TOM6
YEL059W Pe’er et al, 2001
Bayesian networks: example
• Extra technical points– Model knock-outs as “graph interventions”
(modify counts): more causal edges– Bootstrapping to estimate confident
features (like edges)– Statistical test to score subnetworks of
edges
Some recent models
(Beer et al, 2004)
(Segal et al, 2003)