modeling motifs collecting data · tfbsshape: a motif database for dna shape features of...
TRANSCRIPT
Modeling Motifs – Collecting Data(Measuring and Modeling
Specificity of Protein-DNA Interactions)
Computational Genomics CourseCold Spring Harbor LabsOct 30, 2015
Gary D. StormoDepartment of Genetics
Outline
• Modeling specificity with a position weight matrix (PWM)– General features– Limitations and extensions
• How to set the weights– General ideas, some history– Using high-throughput experimental data– Using in vivo location data (Chip-chip/seq)
Terminology: Sites vs Motifs{Sites} Motif
Think restriction sites
EcoRI: {GAATTC} GAATTCHincII {GTTAAC,GTTGAC,GTCAAC,GTCGAC} GTYRAC
Transcription factor motifs should be quantitative, give different scores to different sites, reflectingdifferences in binding affinity
Also: site is specific location in genome
Representations/Models
of Protein-DNA binding
•Transcription factors don’t bind to just one sequence
•A “Consensus sequence” is usually the preferred site, but similar sequences also bind well
•Not all variants bind equally well; some positions contribute more to the specificity than others
Position Weight Matrix Model(PWM, also PSSM)
A: -8 10 -1 2 1 -8
C: -10 -9 -3 -2 -1 -12
G: -7 -9 -1 -1 -4 -9
T: 10 -6 9 0 -1 11
PWM Model
A: -8 10 -1 2 1 -8
C: -10 -9 -3 -2 -1 -12
G: -7 -9 -1 -1 -4 -9
T: 10 -6 9 0 -1 11
….A C T A T A A T G T …
Score = -24
PWM Model
A: -8 10 -1 2 1 -8
C: -10 -9 -3 -2 -1 -12
G: -7 -9 -1 -1 -4 -9
T: 10 -6 9 0 -1 11
….A C T A T A A T G T …
Score = 43
A: -8 10 -1 2 1 -8
C: -10 -9 -3 -2 -1 -12
G: -7 -9 -1 -1 -4 -9
T: 10 -6 9 0 -1 11
PWM ModelA: -8 10 -1 2 1 -8
C: -10 -9 -3 -2 -1 -12
G: -7 -9 -1 -1 -4 -9
T: 10 -6 9 0 -1 11
PWM is a generalization of consensus sequence.There is NO advantage in consensus sequences.Given a consensus sequence one can define a PWMand a threshold that will return the same sites.
PWM ModelA: -8 10 -1 2 1 -8
C: -10 -9 -3 -2 -1 -12
G: -7 -9 -1 -1 -4 -9
T: 10 -6 9 0 -1 11
( )i iScore S W S �
PWM is a linear model:• Si encodes the sequence (which base occurs at each position)• W weights those encoded features to provide the score• Easy to add more features if they are necessary
Two important issues need to be addressed
• Parameter estimation: Where do the matrix elements come from? Different types of data lead to different methods of parameter estimation.
• Additivity: do the positions really contribute independently to the binding interaction? If not, how to we extend the model?
Complete binding energy list vs model.
-0.7901.25TT
↓↓↓↓
-0.790-0.13AT
1.380.42-1.21AG
1.38-0.420AC
1.380.420.83AA
321
↓
-0.791.25TTT
↓↓↓
-0.790.83AAT
1.381.25AAG
1.380.41AAC
1.381.25AAA
21
If simple additive model is inadequate, can use di-nucleotide or
higher-order models. Some form of a matrix model must be correct
because binding the binding data itself is a matrix (vector).
Alternative approach to higher-order contributions: structure parameters
Maybe the non-additivity is due to structural preferencesknown to be dependent on nearest neighbor bases (or longer)
May capture context effects with fewer parameters
For example, see work by Rohs and colleagues:
Covariation between homeodomain transcription factors and the shape of their DNA binding sites. Nucleic Acids Res. 2014 42:430-41TFBSshape: a motif database for DNA shape features of transcription factor binding sites. Nucleic Acids Res. 2014 42(Database issue):D148-55.Quantitative modeling of transcription factor binding specificities using DNA shape. Proc Natl Acad Sci U S A. 2015 112(15):4654-9.
How to Set the Matrix Elements• Statistical treatment of known sites. Need a
reasonable sample size. Some assumptions about how the sample is obtained.
- probabilistic model is easy, can be accurate if assumptions are reasonable
• Quantitative binding data: determine matrix parameters that provide the best fit.
- Has been laborious and slow experimental work, but new technologies make this much easier
N(b,i)
F(b,i)
W(b,i) = log[F(b,i)/P(b)]
I(i) = ∑F(b,i)W(b,i)
Modeling based on known sites
Log-odds
PWM
Classic Logo (from Tom Schneider): Height of column at each position is Information ContentEach base in proportion to its frequency
Modeling from experimental data
• From single binding site experiments to high-throughput methods that allow for the determination of specificity (relative affinity) across all possible sequences at once
Quantitative
Binding Affinity
of TF for one
sequence
Specificity
Refers to the relative
Affinity to different
Sequences, ideally to
All sequences
Specificity
Modeling
High-throughput experimental methods
to Measure TF specificity
High-throughput in vitro binding site analyses
• Can give good, quantitative models of intrinsic binding specificity
• More data alone isn’t sufficient to give better models, also need good analysis methods
• Log-odds method is based on assumptions that may not be true
• Energetic models can give better descriptions– Non-linear relationship between binding affinity
and binding probability at high TF concentration
Log-odds method is equivalent to an energy
model if the sites are from a Boltzmann
distribution with binding probability ∝ 𝒆−𝑬
( ) ( ) /
( )ln( )
iEi i
ii
i
F S P S e Z
F SEP S
�
�
posterior prior
Log-odds relationship between binding energy and frequencies
energy
Reality is a Fermi-Dirac distribution with Boltzmann a special case at the low concentration range
Djordjevic et al, Genome Res. 2003 13:2381-90.
Additive changesin binding energy canhave non-independent(context dependent)effects on bindingprobability
Probabilities nolonger factor, eventhough energiesare additive
HT-SELEX (SELEX-Seq)
2min [ ( ) ( )]
( ) ( )1 i
i ii
i i
N S n S
aN S b n S datae P� � �
�
� �
¦
W S
Parameters to fit: a, b, W, μ
Fit of model to HT-SELEX data for zif268
BEEML vs BioProspector
Zhao et al, PLoS Comp Bio, 2009
Protein Binding Microarray (PBM)
Example of Plag1 using BEEML-PBM
Nat Biotechnol. 2013 31:126-34.
• Most TFs (~90%) fit well by PWMs• BEEML-PBM among the best methods• Some do better with di-nuc models• A few require multiple modes of interaction• Best models fit in vivo data as well as in vivo-derived models
Zhao and Stormo, Nature Biotechnol. 2011 29:480-483
Bacterial-1-Hybrid (B1H)
B1H on zif268 returns the expected model
Average Prediction Accuracy for ZFPs
http://stormo.wustl.edu/ZFModels/
HT-SELEX (SELEX-Seq)
𝑃(𝑆𝑖|𝑏)𝑃(𝑆𝑖)
∝ 1 1+𝑒𝐸𝑖−𝜇
Compared to reference sequence with E = 0
𝑃 𝑆𝑖 𝑏𝑃 𝑆𝑖
𝑃 𝑆𝑟𝑒𝑓 𝑏𝑃 𝑆𝑟𝑒𝑓
= 1+𝑒−𝜇
1+𝑒𝐸𝑖−𝜇
Spec-seq (specificity by sequencing)
𝑃(𝑆𝑖|𝑏)𝑃(𝑆𝑖|𝑢)
= 𝑒𝜇−𝐸𝑖
Compared to reference sequence with E = 0
𝑃 𝑆𝑟𝑒𝑓 𝑏𝑃 𝑆𝑟𝑒𝑓𝑃 𝑆𝑖 𝑏𝑃 𝑆𝑖
= 𝑒𝐸𝑖 → 𝑙𝑛𝑃 𝑆𝑟𝑒𝑓 𝑏𝑃 𝑆𝑟𝑒𝑓𝑃 𝑆𝑖 𝑏𝑃 𝑆𝑖
= 𝐸𝑖
𝑲𝑨 𝑺𝟏 :𝑲𝑨 𝑺𝟐 :… :𝑲𝑨 𝑺𝒏
=𝐏 ∙ 𝑺𝟏𝑺𝟏
:𝐏 ∙ 𝑺𝟐𝑺𝟐
: … :𝐏 ∙ 𝑺𝒏𝑺𝒏
𝐏 + 𝑺𝒊 ↔ 𝐏 ∙ 𝑺𝒊
𝑲𝑨(𝑺𝒊) =[𝐏 ∙ 𝑺𝒊]𝑷 [𝑺𝒊]
Spec-seq: Specificity bysequencing
Specificity of the
Lac repressor
WT operator isasymmetric
4 libraries: vary both sequence and spacing
2560 different bindingsites
Highly reproducible:~5% variance in affinity~0.1kT variance in energy
Zuo and Stormo, Genetics, 2014
Three‐dimensional structure of the dimeric lac HP62–O1 operator complex.
Kalodimos C G et al. EMBO J. 2002;21:2866-2876
©2002 by European Molecular Biology Organization
𝑲𝐗|𝐘 𝒙𝟏 :𝑲𝐗|𝐘 𝒙𝟐 :… :𝑲𝐗|𝐘 𝒙𝒏 = 𝑵 𝒙𝟏 𝑩𝐗,𝐘𝑵 𝒙𝟏 𝑩−,𝐘
: 𝑵 𝒙𝟐 𝑩𝐗,𝐘𝑵 𝒙𝟐 𝑩−,𝐘
: … : 𝑵(𝒙𝒏|𝑩𝐗,𝐘)𝑵(𝒙𝒏|𝑩−,𝐘)
𝝎𝒊 =𝑲𝐗|𝐘 𝑺𝒊𝑲𝐗 𝑺𝒊
= 𝑲𝐘|𝐗 𝑺𝒊𝑲𝐘 𝑺𝒊
= 𝑲𝐗,𝐘 𝑺𝒊𝑲𝐗 𝑺𝒊 𝑲𝐘 𝑺𝒊
𝑲𝐗 𝒙𝟏 :𝑲𝐗 𝒙𝟐 :… :𝑲𝐗 𝒙𝒏 = 𝑵 𝒙𝟏 𝑩𝐗,−𝑵 𝒙𝟏 𝑩−,−
: 𝑵 𝒙𝟐 𝑩𝐗,−𝑵 𝒙𝟐 𝑩−,−
: … : 𝑵(𝒙𝒏|𝑩𝐗,−)𝑵(𝒙𝒏|𝑩−,−)
Spec-seq for combinatorial bindingcan get all of the important parametersin one experiment, including cooperativity
Stormo, Zuo, Chang, Briefings in Functional Genomics, 2015
Specificity
Modeling
Conclusions: 1. Different types of high-throughput data can be used to obtain
good specificity models; good analysis methods are critical
2. PWMs are often (usually?) good approximations, but higher
order models can be obtained if needed
Discovery of Binding Motifs from in vivo data
Datatypes for Motif Discovery• Co-regulated genes
– Genetic studies (deletion, over-expression effects)– Expression analysis (microarrays, RNA-Seq)
• Co-bound regions– ChIP-chip/-Seq location analysis
• Phylogenetic analysis, conservation across species– “phylogenetic footprinting”– Can be combined with multigene analysis, even over the whole genome
Goal: Find the “most significant” pattern in common• Can’t look at all possible alignments – too many• In vitro analysis methods don’t work; assumptions not valid
Outline of problem
CE1CG\TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAGACTGTTTTTTTGATCGTTTTCACAAAAATGGAAGTCCACAGTCTTGACAG\ECOARABOP\GACAAAAACGCGTAACAAAAGTGTCTATAATCACGGCAGAAAAGTCCACATTGATTATTTGCACGGCGTCACACTTTGCTATGCCATAGCATTTTTATCCATAAG\ECOBGLR1\ACAAATCCCAATAACTTAATTATTGGGATTTGTTATATATAACTTTATAAATTCCTAAAATTACACAAAGTTAATAACTGTGAGCATGGTCATATTTTTATCAAT\ECOCRP\CACAAAGCGAAAGCTATGCTAAAACAGTCAGGATGCTACAGTAATACATTGATGTACTGCATGTATGCAAAGGACGTCACATTACCGTGCAGTACAGTTGATAGC\ECOCYA\ACGGTGCTACACTTGTATGTAGCGCATCTTTCTTTACGGTCAATCAGCAAGGTGTTAAATTGATCACGTTTTAGACCATTTTTTCGTCGTGAAACTAAAAAAACC\ECODEOP2\AGTGAATTATTTGAACCAGATCGCATTACAGTGATGCAAACTTGTAAGTAGATTTCCTTAATTGTGATGTGTATCGAAGTGTGTTGCGGAGTAGATGTTAGAATA\ECOGALE\GCGCATAAAAAACGGCTAAATTCTTGTGTAAACGATTCCACTAATTTATTCCATGTCACACTTTTCGCATCTTTGTTATGCTATGGTTATTTCATACCATAAGCC\ECOILVBPR\GCTCCGGCGGGGTTTTTTGTTATCTGCAATTCAGTACAAAACGTGATCAACCCCTCAATTTTCCCTTTGCTGAAAAATTTTCCATTGTCTCCCCTGTAAAGCTGT\ECOLAC\AACGCAATTAATGTGAGTTAGCTCACTCATTAGGCACCCCAGGCTTTACACTTTATGCTTCCGGCTCGTATGTTGTGTGGAATTGTGAGCGGATAACAATTTCAC\ECOMALBA\ACATTACCGCCAATTCTGTAACAGAGATCACACAAAGCGACGGTGGGGCGTAGGGGCAAGGAGGATGGAAAGAGGTTGCCGTATAAAGAAACTAGAGTCCGTTTA\ECOMALBA\GGAGGAGGCGGGAGGATGAGAACACGGCTTCTGTGAACTAAACCGAGGTCATGTAAGGAATTTCGTGATGTTGCTTGCAAAAATCGTGGCGATTTTATGTGCGCA\ECOMALT\GATCAGCGTCGTTTTAGGTGAGTTGTTAATAAAGATTTGGAATTGTGACACAGTGCAAATTCAGACACATAAAAAAACGTCATCGCTTGCATTAGAAAGGTTTCT\ECOOMPA\GCTGACAAAAAAGATTAAACATACCTTATACAAGACTTTTTTTTCATATGCCTGACGGAGTTCACACTTGTAAGTTTTCAACTACGTTGTAGACTTTACATCGCC\ECOTNAA\TTTTTTAAACATTAAAATTCTTACGTAATTTATAATCTTTAAAAAAAGCATTTAATATTGCTCCCCGAACGATTGTGATTCGATTCACATTTAAACAATTTCAGA\ECOUXU1\CCCATGAGAGTGAAATTGTTGTGATGTGGTTAACCCAATTAGAATTCGGGATTGACATGTCTTACCAAAAGGTAGAACTTATACGCCATCTCATCCGATGCAAGC\PBR322\CTGGCTTAACTATGCGGCATCAGAGCAGATTGTACTGAGAGTGCACCATATGCGGTGTGAAATACCGCACAGATGCGTAAGGAGAAAATACCGCATCAGGCGCTC\TRN9CAT\CTGTGACGGAAGATCACTTCGCAGAATAAATAAATCCTGGTGTCCCTGTTGATACCGGGAAGCCCTGGGCCAACTTTTGGCGAAAATGAGACGTTGATCGGCACG\TDC \GATTTTTATACTTTAACTTGTTGATATTTAAAGGTATTTAATTGTAATAACGATACTCTGGAAAGTATTGAAAGTTAATTTGTGAGTGGTCGCACATATCCTGTT\
Example dataset: promoter region from co-regulated genes
Expectation Maximization (EM)Approach to Motif Discovery
Basic Idea:- Given sites, estimate PWM (log-odds model)- Given PWM, pick likely sites according to their
probability- Iterate between those steps until convergenceAlgorithm:• Initial PWM from average of all possible sites• Using current PWM estimate probability of each
position being site; make new PWM from weighted average of all sites
• Iterate to convergence; usually fast, no guarantee of optimal
Gibbs’ Sampling Approach to Motif Discovery
Same Basic Idea:- Given sites, estimate PWM (log-odds model)- Given PWM, pick likely sites according to their
probability- Iterate between those steps until convergenceAlgorithm:• Pick 1 site from N-1 sequences, make PWM• Use “pseudocounts” to avoid prob. = 0• Use current PWM to pick site from left out sequence
by sampling from probability disturbition; update PWM• Iterate to convergence; run multiple times, compare
results; still no guarantee of optimal but avoids local optima often obtained with EM
From Lawrence et al, (1993) Science 1 262:208-14.
A B
Motif discovery from co-regulated genes
Single species Multiple species
Example – Leu3
Alignment of profilesA . . 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . .C . . 0 1 1 2 4 2 4 4 4 0 0 0 0 0 4 0 0 0 4 4 0 0 4 0 0 0 0 . .G . . 1 0 0 0 0 0 0 0 0 4 4 0 0 4 0 4 4 4 0 0 3 0 0 4 0 0 0 . .T . . 3 3 3 2 0 2 0 0 0 0 0 4 0 0 0 0 0 0 0 0 1 4 0 0 4 4 4 . .
A . . 0 0 4 2 0 0 0 0 0 0 0 0 4 4 0 0 0 1 0 0 0 3 1 0 0 3 4 . .C . . 0 2 0 1 0 0 0 4 4 0 0 0 0 0 4 0 0 1 4 1 0 1 0 0 1 0 0 . .G . . 0 0 0 0 4 4 0 0 0 4 4 0 0 0 0 4 4 0 0 2 0 0 3 1 3 1 0 . .T . . 4 2 0 1 0 0 4 0 0 0 0 4 0 0 0 0 0 2 0 1 4 0 0 3 0 0 0 . .
A . . 0 2 1 1 0 1 0 0 0 0 0 0 4 0 0 0 0 0 0 0 1 2 3 0 0 1 1 . .C . . 3 0 1 1 4 0 0 4 4 0 0 0 0 4 4 0 0 4 0 0 0 0 1 1 0 3 2 . .G . . 1 1 2 0 0 0 4 0 0 4 4 0 0 0 0 4 4 0 0 0 3 2 0 0 1 0 1 . . T . . 0 1 0 2 0 3 0 0 0 0 0 4 0 0 0 0 0 0 4 4 0 0 0 3 3 0 0 . .
YGL125W
YOR108W
YMR108W
S. cerevisiae GAAAAAATAACAGCGACTTTTCTCCCGGTAGCGGGCCGTCGTTTAGTCATTCTATCCCTCS. mikatae AAAACATAACAGCGAATTTTCCTCCCGGTAGCGGGCCTTCGTTTAGTCATTCTCTCTCTTS. bayanus AAAAAATAACAGCGACTTTTCCCCCCGGTAGCGGGCCGTCGTTTAGTCATTCTCTCTCCCS. kudriavzevii GAAAAAAAACAACGGCGGCCTCCCCCGGTAGCGGGCCGTCGTTTAGTCATTCTCTCTCTC
***** **** ** *** * *************************************
YGL125W
S. cerevisiae GCCATCATGGTCCGGTAACGGTCGTAGTGAATGACTCATATTTTTCCATCTCTTTS. mikatae GCCATCAAGGTCCGGTAACGGTCGTAGTGAATGACTCACATTTTCTTCGTTATTCS. bayanus ACCATTACGGTCCGGTAACGGACTTAGTGAATGATTCATCTTTTCTTCTTTTTTCS. kudriavzevii GTCGTTAAGGTCCGGTAACGGCCCTCAGCGAATGATTCATAATTTCATTTTTTTC
***** * ************* * ********** *** **** *** ***
YOR108W
S. cerevisiae AACGCCTAGCCGCCGGAGCCTGCCGGTACCGGCTTGGCTTCAGTTGCTGATCTCGGS. mikatae CACAATGACACATACCTAACAGCCGGTACCGGCTTGAATGCCGCCGTTGGCTTCGGS. bayanus ATCTTCTAGTCACCGCAGTCTGCCGGTACCGGCTTGAATTCCGCCGTTGATCCTGGS. kudriavzevii CACATCTCTAGTCCGCGCTCTGCCGGTACCGGCTTAGACTAGCCACGAATCTCGGC
** *** * **** ***************** **** ** * ** **
YMR108W
Alignment of conserved regions
Wang and Stormo, Bioinformatics 2003
Even whole genome search for conserved, multi-copy elements (eg. PhyloNet)
Wang and Stormo, PNAS 2005