modeling motifs collecting data · tfbsshape: a motif database for dna shape features of...

Modeling Motifs – Collecting Data(Measuring and Modeling

Specificity of Protein-DNA Interactions)

Computational Genomics CourseCold Spring Harbor LabsOct 30, 2015

Gary D. StormoDepartment of Genetics

Outline

• Modeling specificity with a position weight matrix (PWM)– General features– Limitations and extensions

• How to set the weights– General ideas, some history– Using high-throughput experimental data– Using in vivo location data (Chip-chip/seq)

Terminology: Sites vs Motifs{Sites} Motif

Think restriction sites

EcoRI: {GAATTC} GAATTCHincII {GTTAAC,GTTGAC,GTCAAC,GTCGAC} GTYRAC

Transcription factor motifs should be quantitative, give different scores to different sites, reflectingdifferences in binding affinity

Also: site is specific location in genome

Representations/Models

of Protein-DNA binding

•Transcription factors don’t bind to just one sequence

•A “Consensus sequence” is usually the preferred site, but similar sequences also bind well

•Not all variants bind equally well; some positions contribute more to the specificity than others

Position Weight Matrix Model(PWM, also PSSM)

A: -8 10 -1 2 1 -8

C: -10 -9 -3 -2 -1 -12

G: -7 -9 -1 -1 -4 -9

T: 10 -6 9 0 -1 11

PWM Model

A: -8 10 -1 2 1 -8

C: -10 -9 -3 -2 -1 -12

G: -7 -9 -1 -1 -4 -9

T: 10 -6 9 0 -1 11

….A C T A T A A T G T …

Score = -24

PWM Model

A: -8 10 -1 2 1 -8

C: -10 -9 -3 -2 -1 -12

G: -7 -9 -1 -1 -4 -9

T: 10 -6 9 0 -1 11

….A C T A T A A T G T …

Score = 43

A: -8 10 -1 2 1 -8

C: -10 -9 -3 -2 -1 -12

G: -7 -9 -1 -1 -4 -9

T: 10 -6 9 0 -1 11

PWM ModelA: -8 10 -1 2 1 -8

C: -10 -9 -3 -2 -1 -12

G: -7 -9 -1 -1 -4 -9

T: 10 -6 9 0 -1 11

PWM is a generalization of consensus sequence.There is NO advantage in consensus sequences.Given a consensus sequence one can define a PWMand a threshold that will return the same sites.

PWM ModelA: -8 10 -1 2 1 -8

C: -10 -9 -3 -2 -1 -12

G: -7 -9 -1 -1 -4 -9

T: 10 -6 9 0 -1 11

( )i iScore S W S �

PWM is a linear model:• Si encodes the sequence (which base occurs at each position)• W weights those encoded features to provide the score• Easy to add more features if they are necessary

Two important issues need to be addressed

• Parameter estimation: Where do the matrix elements come from? Different types of data lead to different methods of parameter estimation.

• Additivity: do the positions really contribute independently to the binding interaction? If not, how to we extend the model?

Complete binding energy list vs model.

-0.7901.25TT

↓↓↓↓

-0.790-0.13AT

1.380.42-1.21AG

1.38-0.420AC

1.380.420.83AA

321

↓

-0.791.25TTT

↓↓↓

-0.790.83AAT

1.381.25AAG

1.380.41AAC

1.381.25AAA

21

If simple additive model is inadequate, can use di-nucleotide or

higher-order models. Some form of a matrix model must be correct

because binding the binding data itself is a matrix (vector).

Alternative approach to higher-order contributions: structure parameters

Maybe the non-additivity is due to structural preferencesknown to be dependent on nearest neighbor bases (or longer)

May capture context effects with fewer parameters

For example, see work by Rohs and colleagues:

Covariation between homeodomain transcription factors and the shape of their DNA binding sites. Nucleic Acids Res. 2014 42:430-41TFBSshape: a motif database for DNA shape features of transcription factor binding sites. Nucleic Acids Res. 2014 42(Database issue):D148-55.Quantitative modeling of transcription factor binding specificities using DNA shape. Proc Natl Acad Sci U S A. 2015 112(15):4654-9.

How to Set the Matrix Elements• Statistical treatment of known sites. Need a

reasonable sample size. Some assumptions about how the sample is obtained.

- probabilistic model is easy, can be accurate if assumptions are reasonable

• Quantitative binding data: determine matrix parameters that provide the best fit.

- Has been laborious and slow experimental work, but new technologies make this much easier

N(b,i)

F(b,i)

W(b,i) = log[F(b,i)/P(b)]

I(i) = ∑F(b,i)W(b,i)

Modeling based on known sites

Log-odds

PWM

Classic Logo (from Tom Schneider): Height of column at each position is Information ContentEach base in proportion to its frequency

Modeling from experimental data

• From single binding site experiments to high-throughput methods that allow for the determination of specificity (relative affinity) across all possible sequences at once

Quantitative

Binding Affinity

of TF for one

sequence

Specificity

Refers to the relative

Affinity to different

Sequences, ideally to

All sequences

Specificity

Modeling

High-throughput experimental methods

to Measure TF specificity

High-throughput in vitro binding site analyses

• Can give good, quantitative models of intrinsic binding specificity

• More data alone isn’t sufficient to give better models, also need good analysis methods

• Log-odds method is based on assumptions that may not be true

• Energetic models can give better descriptions– Non-linear relationship between binding affinity

and binding probability at high TF concentration

Log-odds method is equivalent to an energy

model if the sites are from a Boltzmann

distribution with binding probability ∝ 𝒆−𝑬

( ) ( ) /

( )ln( )

iEi i

ii

i

F S P S e Z

F SEP S

�

�

posterior prior

Log-odds relationship between binding energy and frequencies

energy

Reality is a Fermi-Dirac distribution with Boltzmann a special case at the low concentration range

Djordjevic et al, Genome Res. 2003 13:2381-90.

Additive changesin binding energy canhave non-independent(context dependent)effects on bindingprobability

Probabilities nolonger factor, eventhough energiesare additive

HT-SELEX (SELEX-Seq)

2min [ ( ) ( )]

( ) ( )1 i

i ii

i i

N S n S

aN S b n S datae P� � �

�

� �

¦

W S

Parameters to fit: a, b, W, μ

Fit of model to HT-SELEX data for zif268

BEEML vs BioProspector

Zhao et al, PLoS Comp Bio, 2009

Protein Binding Microarray (PBM)

Example of Plag1 using BEEML-PBM

Nat Biotechnol. 2013 31:126-34.

• Most TFs (~90%) fit well by PWMs• BEEML-PBM among the best methods• Some do better with di-nuc models• A few require multiple modes of interaction• Best models fit in vivo data as well as in vivo-derived models

Zhao and Stormo, Nature Biotechnol. 2011 29:480-483

Bacterial-1-Hybrid (B1H)

B1H on zif268 returns the expected model

Average Prediction Accuracy for ZFPs

http://stormo.wustl.edu/ZFModels/

HT-SELEX (SELEX-Seq)

𝑃(𝑆𝑖|𝑏)𝑃(𝑆𝑖)

∝ 1 1+𝑒𝐸𝑖−𝜇

Compared to reference sequence with E = 0

𝑃 𝑆𝑖 𝑏𝑃 𝑆𝑖

𝑃 𝑆𝑟𝑒𝑓 𝑏𝑃 𝑆𝑟𝑒𝑓

= 1+𝑒−𝜇

1+𝑒𝐸𝑖−𝜇

Spec-seq (specificity by sequencing)

𝑃(𝑆𝑖|𝑏)𝑃(𝑆𝑖|𝑢)

= 𝑒𝜇−𝐸𝑖

Compared to reference sequence with E = 0

𝑃 𝑆𝑟𝑒𝑓 𝑏𝑃 𝑆𝑟𝑒𝑓𝑃 𝑆𝑖 𝑏𝑃 𝑆𝑖

= 𝑒𝐸𝑖 → 𝑙𝑛𝑃 𝑆𝑟𝑒𝑓 𝑏𝑃 𝑆𝑟𝑒𝑓𝑃 𝑆𝑖 𝑏𝑃 𝑆𝑖

= 𝐸𝑖

𝑲𝑨 𝑺𝟏 :𝑲𝑨 𝑺𝟐 :… :𝑲𝑨 𝑺𝒏

=𝐏 ∙ 𝑺𝟏𝑺𝟏

:𝐏 ∙ 𝑺𝟐𝑺𝟐

: … :𝐏 ∙ 𝑺𝒏𝑺𝒏

𝐏 + 𝑺𝒊 ↔ 𝐏 ∙ 𝑺𝒊

𝑲𝑨(𝑺𝒊) =[𝐏 ∙ 𝑺𝒊]𝑷 [𝑺𝒊]

Spec-seq: Specificity bysequencing

Specificity of the

Lac repressor

WT operator isasymmetric

4 libraries: vary both sequence and spacing

2560 different bindingsites

Highly reproducible:~5% variance in affinity~0.1kT variance in energy

Zuo and Stormo, Genetics, 2014

Three‐dimensional structure of the dimeric lac HP62–O1 operator complex.

Kalodimos C G et al. EMBO J. 2002;21:2866-2876

©2002 by European Molecular Biology Organization

𝑲𝐗|𝐘 𝒙𝟏 :𝑲𝐗|𝐘 𝒙𝟐 :… :𝑲𝐗|𝐘 𝒙𝒏 = 𝑵 𝒙𝟏 𝑩𝐗,𝐘𝑵 𝒙𝟏 𝑩−,𝐘

: 𝑵 𝒙𝟐 𝑩𝐗,𝐘𝑵 𝒙𝟐 𝑩−,𝐘

: … : 𝑵(𝒙𝒏|𝑩𝐗,𝐘)𝑵(𝒙𝒏|𝑩−,𝐘)

𝝎𝒊 =𝑲𝐗|𝐘 𝑺𝒊𝑲𝐗 𝑺𝒊

= 𝑲𝐘|𝐗 𝑺𝒊𝑲𝐘 𝑺𝒊

= 𝑲𝐗,𝐘 𝑺𝒊𝑲𝐗 𝑺𝒊 𝑲𝐘 𝑺𝒊

𝑲𝐗 𝒙𝟏 :𝑲𝐗 𝒙𝟐 :… :𝑲𝐗 𝒙𝒏 = 𝑵 𝒙𝟏 𝑩𝐗,−𝑵 𝒙𝟏 𝑩−,−

: 𝑵 𝒙𝟐 𝑩𝐗,−𝑵 𝒙𝟐 𝑩−,−

: … : 𝑵(𝒙𝒏|𝑩𝐗,−)𝑵(𝒙𝒏|𝑩−,−)

Spec-seq for combinatorial bindingcan get all of the important parametersin one experiment, including cooperativity

Stormo, Zuo, Chang, Briefings in Functional Genomics, 2015

Specificity

Modeling

Conclusions: 1. Different types of high-throughput data can be used to obtain

good specificity models; good analysis methods are critical

2. PWMs are often (usually?) good approximations, but higher

order models can be obtained if needed

Discovery of Binding Motifs from in vivo data

Datatypes for Motif Discovery• Co-regulated genes

– Genetic studies (deletion, over-expression effects)– Expression analysis (microarrays, RNA-Seq)

• Co-bound regions– ChIP-chip/-Seq location analysis

• Phylogenetic analysis, conservation across species– “phylogenetic footprinting”– Can be combined with multigene analysis, even over the whole genome

Goal: Find the “most significant” pattern in common• Can’t look at all possible alignments – too many• In vitro analysis methods don’t work; assumptions not valid

Outline of problem

CE1CG\TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAGACTGTTTTTTTGATCGTTTTCACAAAAATGGAAGTCCACAGTCTTGACAG\ECOARABOP\GACAAAAACGCGTAACAAAAGTGTCTATAATCACGGCAGAAAAGTCCACATTGATTATTTGCACGGCGTCACACTTTGCTATGCCATAGCATTTTTATCCATAAG\ECOBGLR1\ACAAATCCCAATAACTTAATTATTGGGATTTGTTATATATAACTTTATAAATTCCTAAAATTACACAAAGTTAATAACTGTGAGCATGGTCATATTTTTATCAAT\ECOCRP\CACAAAGCGAAAGCTATGCTAAAACAGTCAGGATGCTACAGTAATACATTGATGTACTGCATGTATGCAAAGGACGTCACATTACCGTGCAGTACAGTTGATAGC\ECOCYA\ACGGTGCTACACTTGTATGTAGCGCATCTTTCTTTACGGTCAATCAGCAAGGTGTTAAATTGATCACGTTTTAGACCATTTTTTCGTCGTGAAACTAAAAAAACC\ECODEOP2\AGTGAATTATTTGAACCAGATCGCATTACAGTGATGCAAACTTGTAAGTAGATTTCCTTAATTGTGATGTGTATCGAAGTGTGTTGCGGAGTAGATGTTAGAATA\ECOGALE\GCGCATAAAAAACGGCTAAATTCTTGTGTAAACGATTCCACTAATTTATTCCATGTCACACTTTTCGCATCTTTGTTATGCTATGGTTATTTCATACCATAAGCC\ECOILVBPR\GCTCCGGCGGGGTTTTTTGTTATCTGCAATTCAGTACAAAACGTGATCAACCCCTCAATTTTCCCTTTGCTGAAAAATTTTCCATTGTCTCCCCTGTAAAGCTGT\ECOLAC\AACGCAATTAATGTGAGTTAGCTCACTCATTAGGCACCCCAGGCTTTACACTTTATGCTTCCGGCTCGTATGTTGTGTGGAATTGTGAGCGGATAACAATTTCAC\ECOMALBA\ACATTACCGCCAATTCTGTAACAGAGATCACACAAAGCGACGGTGGGGCGTAGGGGCAAGGAGGATGGAAAGAGGTTGCCGTATAAAGAAACTAGAGTCCGTTTA\ECOMALBA\GGAGGAGGCGGGAGGATGAGAACACGGCTTCTGTGAACTAAACCGAGGTCATGTAAGGAATTTCGTGATGTTGCTTGCAAAAATCGTGGCGATTTTATGTGCGCA\ECOMALT\GATCAGCGTCGTTTTAGGTGAGTTGTTAATAAAGATTTGGAATTGTGACACAGTGCAAATTCAGACACATAAAAAAACGTCATCGCTTGCATTAGAAAGGTTTCT\ECOOMPA\GCTGACAAAAAAGATTAAACATACCTTATACAAGACTTTTTTTTCATATGCCTGACGGAGTTCACACTTGTAAGTTTTCAACTACGTTGTAGACTTTACATCGCC\ECOTNAA\TTTTTTAAACATTAAAATTCTTACGTAATTTATAATCTTTAAAAAAAGCATTTAATATTGCTCCCCGAACGATTGTGATTCGATTCACATTTAAACAATTTCAGA\ECOUXU1\CCCATGAGAGTGAAATTGTTGTGATGTGGTTAACCCAATTAGAATTCGGGATTGACATGTCTTACCAAAAGGTAGAACTTATACGCCATCTCATCCGATGCAAGC\PBR322\CTGGCTTAACTATGCGGCATCAGAGCAGATTGTACTGAGAGTGCACCATATGCGGTGTGAAATACCGCACAGATGCGTAAGGAGAAAATACCGCATCAGGCGCTC\TRN9CAT\CTGTGACGGAAGATCACTTCGCAGAATAAATAAATCCTGGTGTCCCTGTTGATACCGGGAAGCCCTGGGCCAACTTTTGGCGAAAATGAGACGTTGATCGGCACG\TDC \GATTTTTATACTTTAACTTGTTGATATTTAAAGGTATTTAATTGTAATAACGATACTCTGGAAAGTATTGAAAGTTAATTTGTGAGTGGTCGCACATATCCTGTT\

Example dataset: promoter region from co-regulated genes

Expectation Maximization (EM)Approach to Motif Discovery

Basic Idea:- Given sites, estimate PWM (log-odds model)- Given PWM, pick likely sites according to their

probability- Iterate between those steps until convergenceAlgorithm:• Initial PWM from average of all possible sites• Using current PWM estimate probability of each

position being site; make new PWM from weighted average of all sites

• Iterate to convergence; usually fast, no guarantee of optimal

Gibbs’ Sampling Approach to Motif Discovery

Same Basic Idea:- Given sites, estimate PWM (log-odds model)- Given PWM, pick likely sites according to their

probability- Iterate between those steps until convergenceAlgorithm:• Pick 1 site from N-1 sequences, make PWM• Use “pseudocounts” to avoid prob. = 0• Use current PWM to pick site from left out sequence

by sampling from probability disturbition; update PWM• Iterate to convergence; run multiple times, compare

results; still no guarantee of optimal but avoids local optima often obtained with EM

From Lawrence et al, (1993) Science 1 262:208-14.

A B

Motif discovery from co-regulated genes

Single species Multiple species

Example – Leu3

Alignment of profilesA . . 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . .C . . 0 1 1 2 4 2 4 4 4 0 0 0 0 0 4 0 0 0 4 4 0 0 4 0 0 0 0 . .G . . 1 0 0 0 0 0 0 0 0 4 4 0 0 4 0 4 4 4 0 0 3 0 0 4 0 0 0 . .T . . 3 3 3 2 0 2 0 0 0 0 0 4 0 0 0 0 0 0 0 0 1 4 0 0 4 4 4 . .

A . . 0 0 4 2 0 0 0 0 0 0 0 0 4 4 0 0 0 1 0 0 0 3 1 0 0 3 4 . .C . . 0 2 0 1 0 0 0 4 4 0 0 0 0 0 4 0 0 1 4 1 0 1 0 0 1 0 0 . .G . . 0 0 0 0 4 4 0 0 0 4 4 0 0 0 0 4 4 0 0 2 0 0 3 1 3 1 0 . .T . . 4 2 0 1 0 0 4 0 0 0 0 4 0 0 0 0 0 2 0 1 4 0 0 3 0 0 0 . .

A . . 0 2 1 1 0 1 0 0 0 0 0 0 4 0 0 0 0 0 0 0 1 2 3 0 0 1 1 . .C . . 3 0 1 1 4 0 0 4 4 0 0 0 0 4 4 0 0 4 0 0 0 0 1 1 0 3 2 . .G . . 1 1 2 0 0 0 4 0 0 4 4 0 0 0 0 4 4 0 0 0 3 2 0 0 1 0 1 . . T . . 0 1 0 2 0 3 0 0 0 0 0 4 0 0 0 0 0 0 4 4 0 0 0 3 3 0 0 . .

YGL125W

YOR108W

YMR108W

S. cerevisiae GAAAAAATAACAGCGACTTTTCTCCCGGTAGCGGGCCGTCGTTTAGTCATTCTATCCCTCS. mikatae AAAACATAACAGCGAATTTTCCTCCCGGTAGCGGGCCTTCGTTTAGTCATTCTCTCTCTTS. bayanus AAAAAATAACAGCGACTTTTCCCCCCGGTAGCGGGCCGTCGTTTAGTCATTCTCTCTCCCS. kudriavzevii GAAAAAAAACAACGGCGGCCTCCCCCGGTAGCGGGCCGTCGTTTAGTCATTCTCTCTCTC

***** **** ** *** * *************************************

YGL125W

S. cerevisiae GCCATCATGGTCCGGTAACGGTCGTAGTGAATGACTCATATTTTTCCATCTCTTTS. mikatae GCCATCAAGGTCCGGTAACGGTCGTAGTGAATGACTCACATTTTCTTCGTTATTCS. bayanus ACCATTACGGTCCGGTAACGGACTTAGTGAATGATTCATCTTTTCTTCTTTTTTCS. kudriavzevii GTCGTTAAGGTCCGGTAACGGCCCTCAGCGAATGATTCATAATTTCATTTTTTTC

***** * ************* * ********** *** **** *** ***

YOR108W

S. cerevisiae AACGCCTAGCCGCCGGAGCCTGCCGGTACCGGCTTGGCTTCAGTTGCTGATCTCGGS. mikatae CACAATGACACATACCTAACAGCCGGTACCGGCTTGAATGCCGCCGTTGGCTTCGGS. bayanus ATCTTCTAGTCACCGCAGTCTGCCGGTACCGGCTTGAATTCCGCCGTTGATCCTGGS. kudriavzevii CACATCTCTAGTCCGCGCTCTGCCGGTACCGGCTTAGACTAGCCACGAATCTCGGC

** *** * **** ***************** **** ** * ** **

YMR108W

Alignment of conserved regions

Wang and Stormo, Bioinformatics 2003

Even whole genome search for conserved, multi-copy elements (eg. PhyloNet)

Wang and Stormo, PNAS 2005

modeling motifs collecting data · tfbsshape: a motif database for dna shape features of...

Documents