motif discovery

Motif discovery

Prof. William Stafford NobleDepartment of Genome Sciences

Department of Computer Science and EngineeringUniversity of Washington

[email protected]

Outline

• One-minute response• Revision• Motifs– Gibbs sampling– Expectation maximization

• Python

One-minute responses• Are we able to read from the smooth curve that we

learned to create today what is a “good enough p-value”?– No, the curve gives you the p-value, but deciding what is “good

enough” requires that you know the costs associated with false positives and false negatives.

• The concepts were clear (for today).• Keep doing more explanation on the board.• Can you please explain the last part of converting scores

to p-values?• Continue with revision every day.

Other questions and comments

• We would prefer to know how we did on the first assessment before you give us the second assessment.

Converting scores to p-values

• Say that your motif has N rows. Create a matrix that has N rows and 100N columns.

• The entry in row i, column j is the number of different sequences of length i that can have a score of j.

A 10 67 59 44

C 60 39 49 29

G 0 71 50 54

T 100 43 13 64

0 1 2 3 4 … 400


• For each value in the first column of your motif, put a 1 in the corresponding entry in the first row of the matrix.

• There are only 4 possible sequences of length 1.

A 10 67 59 44

C 60 39 49 29

G 0 71 50 54

T 100 43 13 64

0 1 2 3 4 … 10 60 100 400

1 11 1


• For each value x in the second column of your motif, consider each value y in the zth column of the first row of the matrix.

• Add y to the x+zth column of the matrix.

A 10 67 59 44

C 60 39 49 29

G 0 71 50 54

T 100 43 13 64

1

1

0 1 2 3 4 … 10 60 77 100 400

11 1


• For each value x in the second column of your motif, consider each value y in the zth column of the first row of the matrix.

• Add y to the x+zth column of the matrix.• What values will go in row 2?

– 10+67, 10+39, 10+71, 10+43, 60+67, …, 100+43• These 16 values correspond to all 16 strings of length 2.

A 10 67 59 44

C 60 39 49 29

G 0 71 50 54

T 100 43 13 64

1

1

0 1 2 3 4 … 10 60 77 100 400

11 1


• In the end, the bottom row contains the scores for all possible sequences of length N.

• Use these scores to compute a p-value.

A 10 67 59 44

C 60 39 49 29

G 0 71 50 54

T 100 43 13 64

1

1

0 1 2 3 4 … 10 60 77 100 400

11 1

Sample problem• List the scores of all 4 length-1 DNA sequences relative to this motif:

A 10 15 100 80C 100 30 7 21G 0 30 22 14T 10 35 9 51

– A=10, C=100, G=0, T=10• List the scores of all 16 length-2 DNA sequences relative to the same

motif.– AA=15, AC=40, AG=40, AT=45, CA=115, CC=130, CG=130, CT=135, GA=15,

GC=30, GG=30, GT=35, TA=25, TC=40, TG=40, TT=45 • Draw the dynamic programming matrix for this motif and indicate

where the 20 scores you computed would go.• 15x2, 25, 30x2, 40x4

Sample problem

• Draw the dynamic programming matrix for this motif and indicate where the 20 scores you computed would go.– AA=15, GA=15, TA=25, GC=30, GG=30, GT=35, AC=40, AG=40, TC=40,

TG=40, AT=45, TT=45, CA=115, CC=130, CG=130, CT=135• How many distinct scores do you observe for sequences of length 2?

– 9• How many calculations will you need to perform to compute scores

for sequences of length 3?– 9 x 4 = 36

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 1351 2 1 2 1 2 1 4 1 1 2 1

Revision

• How many distinct amino acid sequences of length 10 exist?– 2010=1.024 x 1012

• Say that you use dynamic programming to compute the distribution of scores for a motif of width 10. You then observe a sequence with a score of 28. Describe how you would compute the p-value of 28 from the output of the dynamic programming.– Compute the sum S of the counts for scores ≥28. – The p-value is S/2010.

Motif discovery problem

• Given sequences

• Find motif

IGRGGFGEVY at position 515LGEGCFGQVV at position 430VGSGGFGQVY at position 682

seq. 1seq. 2seq. 3

seq. 1seq. 2seq. 3

Motif discovery problem

• Given:– a sequence or family of sequences.

• Find:– the number of motifs– the width of each motif– the locations of motif occurrences

Why is this hard?

• Input sequences are long (thousands or millions of residues).

• Motif may be subtle– Instances are short.– Instances are only slightly

similar.?

?

Globin motifs

xxxxxxxxxxx.xxxxxxxxx.xxxxx..........xxxxxx.xxxxxxx.xxxxxxxxxx.xxxxxxxxxHAHU V.LSPADKTN..VKAAWGKVG.AHAGE..........YGAEAL.ERMFLSF..PTTKTYFPH.FDLS.HGSAHAOR M.LTDAEKKE..VTALWGKAA.GHGEE..........YGAEAL.ERLFQAF..PTTKTYFSH.FDLS.HGSAHADK V.LSAADKTN..VKGVFSKIG.GHAEE..........YGAETL.ERMFIAY..PQTKTYFPH.FDLS.HGSAHBHU VHLTPEEKSA..VTALWGKVN.VDEVG...........G.EAL.GRLLVVY..PWTQRFFES.FGDL.STPDHBOR VHLSGGEKSA..VTNLWGKVN.INELG...........G.EAL.GRLLVVY..PWTQRFFEA.FGDL.SSAGHBDK VHWTAEEKQL..ITGLWGKVNvAD.CG...........A.EAL.ARLLIVY..PWTQRFFAS.FGNL.SSPTMYHU G.LSDGEWQL..VLNVWGKVE.ADIPG..........HGQEVL.IRLFKGH..PETLEKFDK.FKHL.KSEDMYOR G.LSDGEWQL..VLKVWGKVE.GDLPG..........HGQEVL.IRLFKTH..PETLEKFDK.FKGL.KTEDIGLOB M.KFFAVLALCiVGAIASPLT.ADEASlvqsswkavsHNEVEIlAAVFAAY.PDIQNKFSQFaGKDLASIKDGPUGNI A.LTEKQEAL..LKQSWEVLK.QNIPA..........HS.LRL.FALIIEA.APESKYVFSF.LKDSNEIPEGPYL GVLTDVQVAL..VKSSFEEFN.ANIPK...........N.THR.FFTLVLEiAPGAKDLFSF.LKGSSEVPQGGZLB M.L.DQQTIN..IIKATVPVLkEHGVT...........ITTTF.YKNLFAK.HPEVRPLFDM.GRQ..ESLE xxxxx.xxxxxxxxxxxxx..xxxxxxxxxxxxxxx..xxxxxxx.xxxxxxx...xxxxxxxxxxxxxxxxHAHU QVKGH.GKKVADA.LTN......AVA.HVDDMPNA...LSALS.D.LHAHKL....RVDPVNF.KLLSHCLLHAOR QIKAH.GKKVADA.L.S......TAAGHFDDMDSA...LSALS.D.LHAHKL....RVDPVNF.KLLAHCILHADK QIKAH.GKKVAAA.LVE......AVN.HVDDIAGA...LSKLS.D.LHAQKL....RVDPVNF.KFLGHCFLHBHU AVMGNpKVKAHGK.KVLGA..FSDGLAHLDNLKGT...FATLS.E.LHCDKL....HVDPENF.RL.LGNVLHBOR AVMGNpKVKAHGA.KVLTS..FGDALKNLDDLKGT...FAKLS.E.LHCDKL....HVDPENFNRL..GNVLHBDK AILGNpMVRAHGK.KVLTS..FGDAVKNLDNIKNT...FAQLS.E.LHCDKL....HVDPENF.RL.LGDILMYHU EMKASeDLKKHGA.TVL......TALGGILKKKGHH..EAEIKPL.AQSHATK...HKIPVKYLEFISECIIMYOR EMKASaDLKKHGG.TVL......TALGNILKKKGQH..EAELKPL.AQSHATK...HKISIKFLEYISEAIIIGLOB T.GA...FATHATRIVSFLseVIALSGNTSNAAAV...NSLVSKL.GDDHKA....R.GVSAA.QF..GEFRGPUGNI NNPK...LKAHAAVIFKTI...CESATELRQKGHAVwdNNTLKRL.GSIHLK....N.KITDP.HF.EVMKGGPYL NNPD...LQAHAG.KVFKL..TYEAAIQLEVNGAVAs.DATLKSL.GSVHVS....K.GVVDA.HF.PVVKEGGZLB Q......PKALAM.TVL......AAAQNIENLPAIL..PAVKKIAvKHCQAGVaaaH.YPIVGQEL.LGAIK xxxxxxxxx.xxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxx..xHAHU VT.LAA.H..LPAEFTPA..VHASLDKFLASV.STVLTS..KY..RHAOR VV.LAR.H..CPGEFTPS..AHAAMDKFLSKV.ATVLTS..KY..RHADK VV.VAI.H..HPAALTPE..VHASLDKFMCAV.GAVLTA..KY..RHBHU VCVLAH.H..FGKEFTPP..VQAAYQKVVAGV.ANALAH..KY..HHBOR IVVLAR.H..FSKDFSPE..VQAAWQKLVSGV.AHALGH..KY..HHBDK IIVLAA.H..FTKDFTPE..CQAAWQKLVRVV.AHALAR..KY..HMYHU QV.LQSKHPgDFGADAQGA.MNKALELFRKDM.ASNYKELGFQ..GMYOR HV.LQSKHSaDFGADAQAA.MGKALELFRNDM.AAKYKEFGFQ..GIGLOB TA.LVA.Y..LQANVSWGDnVAAAWNKA.LDN.TFAIVV..PR..LGPUGNI ALLGTIKEA.IKENWSDE..MGQAWTEAYNQLVATIKAE..MK..EGPYL AILKTIKEV.VGDKWSEE..LNTAWTIAYDELAIIIKKE..MKdaAGGZLB EVLGDAAT..DDILDAWGK.AYGVIADVFIQVEADLYAQ..AV..E

A Concrete Example: Transcription Factor Binding Sites

• Transcription factor proteins bind to DNA and regulate gene expression.

• The promoter is a region near the start of the gene where transcription factors bind.

TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3


We are given a set of promoters from co-regulated genes.

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3









…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

An unknown transcription factor binds to positions unknown to us, on either DNA strand.








…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3


The DNA binding motif of the transcription factor can be described by a position-specific scoring matrix (PSSM).








…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3


The sequence motif discovery problem is to discover the sites (or the motif) given just the sequences.

Gibbs sampling

Alternating approach

1. Guess an initial weight matrix2. Use weight matrix to predict instances in the

input sequences3. Use instances to predict a weight matrix4. Repeat 2 & 3 until satisfied.

Initialization

• Randomly guess an instance si from each of t input sequences {S1, ..., St}.

sequence 1

sequence 2

sequence 3

sequence 4

sequence 5

ACAGTGTTTAGACCGTGACCAACCCAGGCAGGTTT

Gibbs sampler

• Initially: randomly guess an instance si from each of t input sequences {S1, ..., St}.

• Steps 2 & 3 (search):– Throw away an instance si: remaining (t - 1) instances

define weight matrix.– Weight matrix defines instance probability at each position

of input string Si

– Pick new si according to probability distribution

• Return highest-scoring motif seen

Sampler step illustration:

ACAGTGTTAGGCGTACACCGT???????CAGGTTT

ACGT

.45 .45 .45 .05 .05 .05 .05

.25 .45 .05 .25 .45 .05 .05

.05 .05 .45 .65 .05 .65 .05

.25 .05 .05 .05 .45 .25 .85

ACGCCGT:20% ACGGCGT:52%

ACAGTGTTAGGCGTACACCGTACGCCGTCAGGTTT

sequence 411%








…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

The MEME Algorithm

MEME uses expectation maximization (EM) to discover sequence motifs.

Slides courtesy of Tim Bailey








…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

The MEME Algorithm

The positions (and strands) of the sites are the missing information variables, Z = {Zi,j}.








…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

The MEME Algorithm

If we knew the Z = {Zi,j}, we could estimate the motif PSSM using maximum likelihood.








…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

The MEME Algorithm

Alignment1 AAAAGAGTCA2 AAATGACTCA AAGTGAGTCA AAAAGAGTCA GGATGAGTCAN AAATGAGTCA12 … w

i

j

Frequency MatrixCount Matrix

ACGT

ACGT

12 … w 12 … w








…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3








…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

The MEME Algorithm

MEME starts Expectation Maximization from an initial estimate of θ based on a subsequence in the data.

θ(1)








…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

The MEME Algorithm

EM maximizes the expected joint likelihood of the sequences (X) and missing information (Z) under a probabilistic model.

θ(1)








…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

The MEME Algorithm

The current estimates of the parameters of the data model are φ(t), and include θ(t).

θ(1)

The MEME Algorithm

EM iteratively refines φ(t).

The MEME Algorithm

E-step: Use φ(t) to estimate the probabilities of each position in the data being a site.

The MEME Algorithm

M-step: Use Z(t) to find the value of φ that maximizes the expectation of the joint conditional probability.

Alternating approach

1. Guess an initial weight matrix2. Use weight matrix to predict instances in the

input sequences3. Use instances to predict a weight matrix4. Repeat 2 & 3 until satisfied.

Expectation-maximization

foreach subsequence of width Wconvert subsequence to a matrixdo {

re-estimate motif occurrences from matrixre-estimate matrix model from motif occurrences

} until (matrix model stops changing)endselect matrix with highest score

EM

Comparison

• Both EM and Gibbs sampling involve iterating over two steps

• Convergence:– EM converges when the PSSM stops changing.– Gibbs sampling runs until you ask it to stop.

• Solution:– EM may not find the motif with the highest score.– Gibbs sampling will provably find the motif with the

highest score, if you let it run long enough.

Sample problem #1• You are implementing a Gibbs sampler for motif discovery. Your code has

just scanned a sequence S with a motif PSSM to produce a list of scores x1, …, xn. You will write the code to randomly select the new motif occurrence on the basis of the score list.

• Given:– a list of scores x1, …, xn

• Return:– Print to standard output an integer i in the range 1, …, n and the corresponding

score xi, where i is selected with probability proportional to xi.

> ./randomly-select-4.py scores.txt Read 10000 scores from scores.txt.Sum of scores = 20042.7Random value = 3298.621655 1.41589

Sample problem #2

• Given:– a list of DNA sequences,– the motif width,– the integer index of the (currently estimated) motif

occurrence within each sequence,– the index j of the sequence for which to update the motif

occurrence, and– the list of scores for sequence j.

• Return:– a new alignment with the motif occurrence for sequence j

randomly replaced.

>./do-gibbs-step.py crp0.txt 10 crp0-offsets.txt 1 crp0-scores.txt > fooRead 18 sequences from crp0.txt.Read 95 values from crp0-scores.txt.Read 18 values from crp0-indices.txt.Sum of scores = 189.341Random value = 69.6017Selected 36 with score 2.81402.> cat fooTTTGTGGCATCAGAAAAGTCACTGTGAGCATATGCAAAGGGGTGTTAAATATTGTGATGTAATTTATTCCAACGTGATCAAATGTGAGTTTCTGTAACAGTCTGTGAACTATTGTGACACTGCCTGACGGATTGTGATTCGTTGTGATGTGGTGTGAAATAATGAGACGTTTTGTGAGTG

motif discovery

Documents

converting scores

distinct scores

value x

value y

x zth column

possible sequences of

column j

strings of length