modeling motifs collecting data - fasta · modeling motifs –collecting data (measuring and...

10/29/2016

1

Modeling Motifs – Collecting Data(Measuring and Modeling

Specificity of Protein-DNA Interactions)

Computational Genomics CourseCold Spring Harbor LabsOct 31, 2016

Gary D. StormoDepartment of Genetics

Outline

• Modeling specificity with a position weight matrix (PWM)

– General features

– Limitations and extensions

• How to set the weights

– General ideas, some history

– Using high-throughput experimental data

– Using in vivo location data (Chip-chip/seq)

10/29/2016

2

Terminology: Sites vs Motifs{Sites} Motif

Think restriction sites

EcoRI: {GAATTC} GAATTC

HincII {GTTAAC,GTTGAC,GTCAAC,GTCGAC} GTYRAC

Transcription factor motifs should be quantitative, give different scores to different sites, reflecting

differences in binding affinity

Also: site is specific location in genome

Representations/Models

of Protein-DNA binding

•Transcription factors don’t bind to just one sequence

•A “Consensus sequence” is usually the preferred site, but similar sequences also bind well

•Not all variants bind equally well; some positions contribute more to the specificity than others

10/29/2016

3

Position Weight Matrix Model(PWM, also PSSM)

A: -8 10 -1 2 1 -8

C: -10 -9 -3 -2 -1 -12

G: -7 -9 -1 -1 -4 -9

T: 10 -6 9 0 -1 11

PWM Model

A: -8 10 -1 2 1 -8

C: -10 -9 -3 -2 -1 -12

G: -7 -9 -1 -1 -4 -9

T: 10 -6 9 0 -1 11

….A C T A T A A T G T …

Score = -24

10/29/2016

4

PWM Model

A: -8 10 -1 2 1 -8

C: -10 -9 -3 -2 -1 -12

G: -7 -9 -1 -1 -4 -9

T: 10 -6 9 0 -1 11

….A C T A T A A T G T …

Score = 43

A: -8 10 -1 2 1 -8

C: -10 -9 -3 -2 -1 -12

G: -7 -9 -1 -1 -4 -9

T: 10 -6 9 0 -1 11

10/29/2016

5

PWM Model

A: -8 10 -1 2 1 -8

C: -10 -9 -3 -2 -1 -12

G: -7 -9 -1 -1 -4 -9

T: 10 -6 9 0 -1 11

PWM is a generalization of consensus sequence.There is NO advantage in consensus sequences.Given a consensus sequence one can define a PWMand a threshold that will return the same sites.

PWM Model

A: -8 10 -1 2 1 -8

C: -10 -9 -3 -2 -1 -12

G: -7 -9 -1 -1 -4 -9

T: 10 -6 9 0 -1 11

( )i iScore S W S

PWM is a linear model:• Si encodes the sequence (which base occurs at each position)• W weights those encoded features to provide the score• Easy to add more features if they are necessary

10/29/2016

6

Two important issues need to be addressed

• Parameter estimation: Where do the matrix elements come from? Different types of data lead to different methods of parameter estimation.

• Additivity: do the positions really contribute independently to the binding interaction? If not, how to we extend the model?

Complete binding energy list vs model.

10/29/2016

7

-0.7901.25TT

↓↓↓↓

-0.790-0.13AT

1.380.42-1.21AG

1.38-0.420AC

1.380.420.83AA

321

↓

-0.791.25TTT

↓↓↓

-0.790.83AAT

1.381.25AAG

1.380.41AAC

1.381.25AAA

21

If simple additive model is inadequate, can use di-nucleotide or higher-order models. Some form of a matrix model must be correct because binding the binding data itself is a matrix (vector).

Alternative approach to higher-order contributions: structure parameters

Maybe the non-additivity is due to structural preferencesknown to be dependent on nearest neighbor bases (or longer)

May capture context effects with fewer parameters

For example, see work by Rohs and colleagues:Covariation between homeodomain transcription factors and the shape of their DNA binding sites. Nucleic Acids Res. 2014 42:430-41TFBSshape: a motif database for DNA shape features of transcription factor binding sites. Nucleic Acids Res. 2014 42(Database issue):D148-55.Quantitative modeling of transcription factor binding specificities using DNA shape. Proc Natl Acad Sci U S A. 2015 112(15):4654-9.DNA Shape Features Improve Transcription Factor Binding Site Predictions In Vivo. Cell Syst. 2016 Sep 28;3(3):278-286

http://www.ncbi.nlm.nih.gov/pubmed/24078250

http://www.ncbi.nlm.nih.gov/pubmed/24214955

https://www.ncbi.nlm.nih.gov/pubmed/27546793

10/29/2016

8

How to Set the Matrix Elements

• Statistical treatment of known sites. Need a reasonable sample size. Some assumptions about how the sample is obtained.

- probabilistic model is easy, can be accurate if assumptions are reasonable

• Quantitative binding data: determine matrix parameters that provide the best fit.

- Has been laborious and slow experimental work, but new technologies make this much easier

N(b,i)

F(b,i)

W(b,i) = log[F(b,i)/P(b)]

I(i) = ∑F(b,i)W(b,i)

Modeling based on known sites

Log-oddsPWM

PFM or PPMNote: some papersand programs callthis a PWM

10/29/2016

9

Classic Logo (from Tom Schneider): Height of column at each position is Information ContentEach base in proportion to its frequency

Likelihood Ratio Statistics Primer

Given two probability distributions Pi and QI

∑ Pi = ∑Qi = 1

And some data, Di, which is number of times each type i is observed in N total observations

The Likelihood Ratio of the data being from distributionQi versus Pi is:

LR = ∏ (Qi/Pi)Di

And the log-Likelihood Ratio isLLR = ∑ Di ln (Qi/Pi)

10/29/2016

10

LLR = ∑ Di ln (Qi/Pi)

Maximum likelihood distribution is Qi = Di/N

So max LLR = N ∑ Qi ln (Qi/Pi)

∑ Qi ln (Qi/Pi) ≥ 0

≡ Information ContentRelative EntropyKullbach-Liebler Distance

Related to G-statistic and χ2

Modeling from experimental data

• From single binding site experiments to high-throughput methods that allow for the determination of specificity (relative affinity) across all possible sequences at once

10/29/2016

11

QuantitativeBinding Affinityof TF for one sequence

Specificity

Refers to the relativeAffinity to differentSequences, ideally toAll sequences

10/29/2016

12

SpecificityModeling

High-throughput experimental methods to Measure TF specificity

High-throughput in vitro binding site analyses

• Can give good, quantitative models of intrinsic binding specificity

• More data alone isn’t sufficient to give better models, also need good analysis methods

• Log-odds method is based on assumptions that may not be true

• Energetic models can give better descriptions– Non-linear relationship between binding affinity

and binding probability at high TF concentration

10/29/2016

13

Log-odds method is equivalent to an energy model if the sites are from a Boltzmann

distribution with binding probability ∝ 𝒆−𝑬

( ) ( ) /

( )ln

( )

iE

i i

ii

i

F S P S e Z

F SE

P S

posterior prior

Log-odds relationship between binding energy and frequencies

energy

Reality is a Fermi-Dirac distribution with Boltzmann a special case at the low concentration range

Djordjevic et al, Genome Res. 2003 13:2381-90.

10/29/2016

14

Additive changesin binding energyhave non-independent(context dependent)effects on bindingprobability

Probabilities nolonger factor, eventhough energiesare additive

EG-EA=2kT

GTGGA vs ATGGA

GTGTA vs ATGTA

HT-SELEX (SELEX-Seq)

2min [ ( ) ( )]

( ) ( )1 i

i i

i

i i

N S n S

aN S b n S data

e

W S

Parameters to fit: a, b, W, μ

10/29/2016

15

Fit of model to HT-SELEX data for zif268BEEML vs BioProspector

Zhao et al, PLoS Comp Bio, 2009

Protein Binding Microarray (PBM)

10/29/2016

16

Example of Plag1 using BEEML-PBM

Nat Biotechnol. 2013 31:126-34.

• Most TFs (~90%) fit well by PWMs• BEEML-PBM among the best methods• Some do better with di-nuc models• A few require multiple modes of interaction• Best models fit in vivo data as well as in vivo-derived models

Zhao and Stormo, Nature Biotechnol. 2011 29:480-483

10/29/2016

17

Weirauch et al

Diverse sets:

>100 TFs

~20 TFs

~240 TFs

>1000 TFs

Bacterial-1-Hybrid (B1H)

10/29/2016

18

B1H on zif268 returns the expected model

10/29/2016

19

Average Prediction Accuracy for ZFPs

http://stormo.wustl.edu/ZFModels/

HT-SELEX (SELEX-Seq)

𝑃(𝑆𝑖|𝑏)

𝑃(𝑆𝑖)∝ ൗ1 1+𝑒𝐸𝑖−𝜇

Compared to reference sequence with E = 0

𝑃 𝑆𝑖 𝑏𝑃 𝑆𝑖

𝑃 𝑆𝑟𝑒𝑓 𝑏𝑃 𝑆𝑟𝑒𝑓

=1+𝑒−𝜇

1+𝑒𝐸𝑖−𝜇

10/29/2016

20

Spec-seq (specificity by sequencing)

𝑃(𝑆𝑖|𝑏)

𝑃(𝑆𝑖|𝑢)= 𝑒𝜇−𝐸𝑖

Compared to reference sequence with E = 0



= 𝑒𝐸𝑖 → 𝑙𝑛



= 𝐸𝑖

𝑲𝑨 𝑺𝟏 : 𝑲𝑨 𝑺𝟐 : … :𝑲𝑨 𝑺𝒏

=𝐏 ∙ 𝑺𝟏𝑺𝟏

:𝐏 ∙ 𝑺𝟐𝑺𝟐

: … :𝐏 ∙ 𝑺𝒏𝑺𝒏

𝐏 + 𝑺𝒊 ↔ 𝐏 ∙ 𝑺𝒊

𝑲𝑨(𝑺𝒊) =[𝐏 ∙ 𝑺𝒊]

𝑷 [𝑺𝒊]

Spec-seq: Specificity bysequencing

10/29/2016

21

Specificity of theLac repressor

WT operator isasymmetric

4 libraries: vary both sequence and spacing

2560 different bindingsites

Highly reproducible:~5% variance in affinity~0.1kT variance in energy

Zuo and Stormo, Genetics, 2014

Three‐dimensional structure of the

dimeric lac HP62–O1 operator complex.

Kalodimos C G et al. EMBO J. 2002;21:2866-2876

©2002 by European Molecular

Biology Organization

10/29/2016

22

No motif for half of all human TFs –

Most are C2H2 zinc finger proteins

Laura Campitelli

No motif for half of all human TFs –

Most are C2H2 zinc finger proteins

Matt Weirauch

Known

motif

(637)No motif

(809)

Close

ortholog/paralog

has motif

(219)

C2H2 with

No motif

(573)

Possibly not

sequence-

specific

(143)

Needs

hetero-

dimerization

partner (56)

Not tried/

no data

(37)

Known

motif

(97)

No motif

(573)

Close

ortholog/par

alog

has motif

(44)

Human – all TFs

(1,665)

Human – no motif

(809)

Human – all C2H2s

(714)

10/29/2016

23

ZF specificity predictionUse three programs: ours, One from Princeton group,One from Toronto group

The Logos look pretty differentbut that is largely quantitative,and there are many high IC positions of agreement.By averaging the PFMs one can obtain a consensus sequence that agrees pretty well with all three.

Reverse Consensus (30bp)

TCTTGATGATGCTGCAATATTAATAATTTASpec-seq randomizations:

Consensus is “goodenough” to showshift in EMSA.

So we randomizedfive adjacent positionsat a time, generating6 libraries of 1024 sequences. MergedLogo shows overall good match with consensus and provides quantitativepredictions about binding energy contributions.

10/29/2016

24

Spec-seq motif matches well with motif obtained from invivo recombination hotspotsand using Affinity-seq method

Affinity-seq pulls out genomicDNA fragments in vitro andsequences them withoutAmplification.

Affinity-seq motif

Hotspot motif

Can also be easily adapted to study CpG methylation sensitivityZFP57 involved in Imprinting MaintenanceHas 2 ZF clusters, one binds TGCCGC, prefers mCpG3 libraries with random regions and methylation variants

10/29/2016

25

𝑲𝐗|𝐘 𝒙𝟏 :𝑲𝐗|𝐘 𝒙𝟐 : … :𝑲𝐗|𝐘 𝒙𝒏 =𝑵 𝒙𝟏 𝑩𝐗,𝐘

𝑵 𝒙𝟏 𝑩−,𝐘:𝑵 𝒙𝟐 𝑩𝐗,𝐘

𝑵 𝒙𝟐 𝑩−,𝐘: … :

𝑵(𝒙𝒏|𝑩𝐗,𝐘)

𝑵(𝒙𝒏|𝑩−,𝐘)

𝝎𝒊 =𝑲𝐗|𝐘 𝑺𝒊

𝑲𝐗 𝑺𝒊=

𝑲𝐘|𝐗 𝑺𝒊

𝑲𝐘 𝑺𝒊=

𝑲𝐗,𝐘 𝑺𝒊

𝑲𝐗 𝑺𝒊 𝑲𝐘 𝑺𝒊

𝑲𝐗 𝒙𝟏 :𝑲𝐗 𝒙𝟐 : … :𝑲𝐗 𝒙𝒏 =𝑵 𝒙𝟏 𝑩𝐗,−

𝑵 𝒙𝟏 𝑩−,−:𝑵 𝒙𝟐 𝑩𝐗,−

𝑵 𝒙𝟐 𝑩−,−: … :

𝑵(𝒙𝒏|𝑩𝐗,−)

𝑵(𝒙𝒏|𝑩−,−)

Spec-seq for combinatorial bindingcan get all of the important parametersin one experiment, including cooperativity

Stormo, Zuo, Chang, Briefings in Functional Genomics, 2015

SpecificityModeling

Conclusions: 1. Different types of high-throughput data can be used to obtain

good specificity models; good analysis methods are critical2. PWMs are often (usually?) good approximations, but higher

order models can be obtained if needed

10/29/2016

26

Discovery of Binding Motifs from in vivo data

Datatypes for Motif Discovery• Co-regulated genes

– Genetic studies (deletion, over-expression effects)– Expression analysis (microarrays, RNA-Seq)

• Co-bound regions– ChIP-chip/-Seq location analysis

• Phylogenetic analysis, conservation across species– “phylogenetic footprinting”– Can be combined with multigene analysis, even over the whole genome

Goal: Find the “most significant” pattern in common• Can’t look at all possible alignments – too many• In vitro analysis methods don’t work; assumptions not valid

Outline of problem

10/29/2016

27

CE1CG

\TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAGACTGTTTTTTTGATCGTTTTCACAAAAATGGAAGTCCACAGTCTTGACAG\

ECOARABOP

\GACAAAAACGCGTAACAAAAGTGTCTATAATCACGGCAGAAAAGTCCACATTGATTATTTGCACGGCGTCACACTTTGCTATGCCATAGCATTTTTATCCATAAG\

ECOBGLR1

\ACAAATCCCAATAACTTAATTATTGGGATTTGTTATATATAACTTTATAAATTCCTAAAATTACACAAAGTTAATAACTGTGAGCATGGTCATATTTTTATCAAT\

ECOCRP

\CACAAAGCGAAAGCTATGCTAAAACAGTCAGGATGCTACAGTAATACATTGATGTACTGCATGTATGCAAAGGACGTCACATTACCGTGCAGTACAGTTGATAGC\

ECOCYA

\ACGGTGCTACACTTGTATGTAGCGCATCTTTCTTTACGGTCAATCAGCAAGGTGTTAAATTGATCACGTTTTAGACCATTTTTTCGTCGTGAAACTAAAAAAACC\

ECODEOP2

\AGTGAATTATTTGAACCAGATCGCATTACAGTGATGCAAACTTGTAAGTAGATTTCCTTAATTGTGATGTGTATCGAAGTGTGTTGCGGAGTAGATGTTAGAATA\

ECOGALE

\GCGCATAAAAAACGGCTAAATTCTTGTGTAAACGATTCCACTAATTTATTCCATGTCACACTTTTCGCATCTTTGTTATGCTATGGTTATTTCATACCATAAGCC\

ECOILVBPR

\GCTCCGGCGGGGTTTTTTGTTATCTGCAATTCAGTACAAAACGTGATCAACCCCTCAATTTTCCCTTTGCTGAAAAATTTTCCATTGTCTCCCCTGTAAAGCTGT\

ECOLAC

\AACGCAATTAATGTGAGTTAGCTCACTCATTAGGCACCCCAGGCTTTACACTTTATGCTTCCGGCTCGTATGTTGTGTGGAATTGTGAGCGGATAACAATTTCAC\

ECOMALBA

\ACATTACCGCCAATTCTGTAACAGAGATCACACAAAGCGACGGTGGGGCGTAGGGGCAAGGAGGATGGAAAGAGGTTGCCGTATAAAGAAACTAGAGTCCGTTTA\

ECOMALBA

\GGAGGAGGCGGGAGGATGAGAACACGGCTTCTGTGAACTAAACCGAGGTCATGTAAGGAATTTCGTGATGTTGCTTGCAAAAATCGTGGCGATTTTATGTGCGCA\

ECOMALT

\GATCAGCGTCGTTTTAGGTGAGTTGTTAATAAAGATTTGGAATTGTGACACAGTGCAAATTCAGACACATAAAAAAACGTCATCGCTTGCATTAGAAAGGTTTCT\

ECOOMPA

\GCTGACAAAAAAGATTAAACATACCTTATACAAGACTTTTTTTTCATATGCCTGACGGAGTTCACACTTGTAAGTTTTCAACTACGTTGTAGACTTTACATCGCC\

ECOTNAA

\TTTTTTAAACATTAAAATTCTTACGTAATTTATAATCTTTAAAAAAAGCATTTAATATTGCTCCCCGAACGATTGTGATTCGATTCACATTTAAACAATTTCAGA\

ECOUXU1

\CCCATGAGAGTGAAATTGTTGTGATGTGGTTAACCCAATTAGAATTCGGGATTGACATGTCTTACCAAAAGGTAGAACTTATACGCCATCTCATCCGATGCAAGC\

PBR322

\CTGGCTTAACTATGCGGCATCAGAGCAGATTGTACTGAGAGTGCACCATATGCGGTGTGAAATACCGCACAGATGCGTAAGGAGAAAATACCGCATCAGGCGCTC\

TRN9CAT

\CTGTGACGGAAGATCACTTCGCAGAATAAATAAATCCTGGTGTCCCTGTTGATACCGGGAAGCCCTGGGCCAACTTTTGGCGAAAATGAGACGTTGATCGGCACG\

TDC

\GATTTTTATACTTTAACTTGTTGATATTTAAAGGTATTTAATTGTAATAACGATACTCTGGAAAGTATTGAAAGTTAATTTGTGAGTGGTCGCACATATCCTGTT\

Example dataset: promoter region from co-regulated genes

10/29/2016

28

Expectation Maximization (EM)Approach to Motif Discovery

Basic Idea:- Given sites, estimate PWM (log-odds model)- Given PWM, pick likely sites according to their

probability- Make initial guess, then iterate between those steps

until convergenceAlgorithm:• Initial PWM from average of all possible sites• Using current PWM estimate probability of each

position being site; make new PWM from weighted average of all sites

• Iterate to convergence; usually fast, no guarantee of optimal

Gibbs’ Sampling Approach to Motif Discovery

Same Basic Idea:- Given sites, estimate PWM (log-odds model)- Given PWM, pick likely sites according to their

probability- Iterate between those steps until convergenceAlgorithm:• Pick 1 site from N-1 sequences, make PWM• Use “pseudocounts” to avoid prob. = 0• Use current PWM to pick site from left out sequence

by sampling from probability disturbition; update PWM• Iterate to convergence; run multiple times, compare

results; still no guarantee of optimal but avoids local optima often obtained with EM

10/29/2016

29

From Lawrence et al, (1993) Science 1 262:208-14.

A B

Motif discovery from co-regulated genes

Single species Multiple species

10/29/2016

30

Example – Leu3

Alignment of profiles

A . . 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . .C . . 0 1 1 2 4 2 4 4 4 0 0 0 0 0 4 0 0 0 4 4 0 0 4 0 0 0 0 . .G . . 1 0 0 0 0 0 0 0 0 4 4 0 0 4 0 4 4 4 0 0 3 0 0 4 0 0 0 . .T . . 3 3 3 2 0 2 0 0 0 0 0 4 0 0 0 0 0 0 0 0 1 4 0 0 4 4 4 . .

A . . 0 0 4 2 0 0 0 0 0 0 0 0 4 4 0 0 0 1 0 0 0 3 1 0 0 3 4 . .C . . 0 2 0 1 0 0 0 4 4 0 0 0 0 0 4 0 0 1 4 1 0 1 0 0 1 0 0 . .G . . 0 0 0 0 4 4 0 0 0 4 4 0 0 0 0 4 4 0 0 2 0 0 3 1 3 1 0 . .T . . 4 2 0 1 0 0 4 0 0 0 0 4 0 0 0 0 0 2 0 1 4 0 0 3 0 0 0 . .

A . . 0 2 1 1 0 1 0 0 0 0 0 0 4 0 0 0 0 0 0 0 1 2 3 0 0 1 1 . .C . . 3 0 1 1 4 0 0 4 4 0 0 0 0 4 4 0 0 4 0 0 0 0 1 1 0 3 2 . .G . . 1 1 2 0 0 0 4 0 0 4 4 0 0 0 0 4 4 0 0 0 3 2 0 0 1 0 1 . . T . . 0 1 0 2 0 3 0 0 0 0 0 4 0 0 0 0 0 0 4 4 0 0 0 3 3 0 0 . .

YGL125W

YOR108W

YMR108W

S. cerevisiae GAAAAAATAACAGCGACTTTTCTCCCGGTAGCGGGCCGTCGTTTAGTCATTCTATCCCTCS. mikatae AAAACATAACAGCGAATTTTCCTCCCGGTAGCGGGCCTTCGTTTAGTCATTCTCTCTCTTS. bayanus AAAAAATAACAGCGACTTTTCCCCCCGGTAGCGGGCCGTCGTTTAGTCATTCTCTCTCCCS. kudriavzevii GAAAAAAAACAACGGCGGCCTCCCCCGGTAGCGGGCCGTCGTTTAGTCATTCTCTCTCTC

***** **** ** *** * *************************************

YGL125W

S. cerevisiae GCCATCATGGTCCGGTAACGGTCGTAGTGAATGACTCATATTTTTCCATCTCTTTS. mikatae GCCATCAAGGTCCGGTAACGGTCGTAGTGAATGACTCACATTTTCTTCGTTATTCS. bayanus ACCATTACGGTCCGGTAACGGACTTAGTGAATGATTCATCTTTTCTTCTTTTTTCS. kudriavzevii GTCGTTAAGGTCCGGTAACGGCCCTCAGCGAATGATTCATAATTTCATTTTTTTC

***** * ************* * ********** *** **** *** ***

YOR108W

S. cerevisiae AACGCCTAGCCGCCGGAGCCTGCCGGTACCGGCTTGGCTTCAGTTGCTGATCTCGGS. mikatae CACAATGACACATACCTAACAGCCGGTACCGGCTTGAATGCCGCCGTTGGCTTCGGS. bayanus ATCTTCTAGTCACCGCAGTCTGCCGGTACCGGCTTGAATTCCGCCGTTGATCCTGGS. kudriavzevii CACATCTCTAGTCCGCGCTCTGCCGGTACCGGCTTAGACTAGCCACGAATCTCGGC

** *** * **** ***************** **** ** * ** **

YMR108W

Alignment of conserved regions

Wang and Stormo, Bioinformatics 2003

Even whole genome search for conserved, multi-copy elements (eg. PhyloNet)

Wang and Stormo, PNAS 2005

modeling motifs collecting data - fasta · modeling motifs –collecting data (measuring and...

Documents