motif discovery

42
Motif discovery Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington [email protected]

Upload: malo

Post on 24-Feb-2016

67 views

Category:

Documents


0 download

DESCRIPTION

Motif discovery. Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington [email protected]. Outline. One-minute response Revision Motifs Gibbs sampling Expectation maximization Python. One-minute responses. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Motif  discovery

Motif discovery

Prof. William Stafford NobleDepartment of Genome Sciences

Department of Computer Science and EngineeringUniversity of Washington

[email protected]

Page 2: Motif  discovery

Outline

• One-minute response• Revision• Motifs– Gibbs sampling– Expectation maximization

• Python

Page 3: Motif  discovery

One-minute responses• Are we able to read from the smooth curve that we

learned to create today what is a “good enough p-value”?– No, the curve gives you the p-value, but deciding what is “good

enough” requires that you know the costs associated with false positives and false negatives.

• The concepts were clear (for today).• Keep doing more explanation on the board.• Can you please explain the last part of converting scores

to p-values?• Continue with revision every day.

Page 4: Motif  discovery

Other questions and comments

• We would prefer to know how we did on the first assessment before you give us the second assessment.

Page 5: Motif  discovery

Converting scores to p-values

• Say that your motif has N rows. Create a matrix that has N rows and 100N columns.

• The entry in row i, column j is the number of different sequences of length i that can have a score of j.

A 10 67 59 44

C 60 39 49 29

G 0 71 50 54

T 100 43 13 64

0 1 2 3 4 … 400

Page 6: Motif  discovery

Converting scores to p-values

• For each value in the first column of your motif, put a 1 in the corresponding entry in the first row of the matrix.

• There are only 4 possible sequences of length 1.

A 10 67 59 44

C 60 39 49 29

G 0 71 50 54

T 100 43 13 64

0 1 2 3 4 … 10 60 100 400

1 11 1

Page 7: Motif  discovery

Converting scores to p-values

• For each value x in the second column of your motif, consider each value y in the zth column of the first row of the matrix.

• Add y to the x+zth column of the matrix.

A 10 67 59 44

C 60 39 49 29

G 0 71 50 54

T 100 43 13 64

1

1

0 1 2 3 4 … 10 60 77 100 400

11 1

Page 8: Motif  discovery

Converting scores to p-values

• For each value x in the second column of your motif, consider each value y in the zth column of the first row of the matrix.

• Add y to the x+zth column of the matrix.• What values will go in row 2?

– 10+67, 10+39, 10+71, 10+43, 60+67, …, 100+43• These 16 values correspond to all 16 strings of length 2.

A 10 67 59 44

C 60 39 49 29

G 0 71 50 54

T 100 43 13 64

1

1

0 1 2 3 4 … 10 60 77 100 400

11 1

Page 9: Motif  discovery

Converting scores to p-values

• In the end, the bottom row contains the scores for all possible sequences of length N.

• Use these scores to compute a p-value.

A 10 67 59 44

C 60 39 49 29

G 0 71 50 54

T 100 43 13 64

1

1

0 1 2 3 4 … 10 60 77 100 400

11 1

Page 10: Motif  discovery

Sample problem• List the scores of all 4 length-1 DNA sequences relative to this motif:

A 10 15 100 80C 100 30 7 21G 0 30 22 14T 10 35 9 51

– A=10, C=100, G=0, T=10• List the scores of all 16 length-2 DNA sequences relative to the same

motif.– AA=15, AC=40, AG=40, AT=45, CA=115, CC=130, CG=130, CT=135, GA=15,

GC=30, GG=30, GT=35, TA=25, TC=40, TG=40, TT=45 • Draw the dynamic programming matrix for this motif and indicate

where the 20 scores you computed would go.• 15x2, 25, 30x2, 40x4

Page 11: Motif  discovery

Sample problem

• Draw the dynamic programming matrix for this motif and indicate where the 20 scores you computed would go.– AA=15, GA=15, TA=25, GC=30, GG=30, GT=35, AC=40, AG=40, TC=40,

TG=40, AT=45, TT=45, CA=115, CC=130, CG=130, CT=135• How many distinct scores do you observe for sequences of length 2?

– 9• How many calculations will you need to perform to compute scores

for sequences of length 3?– 9 x 4 = 36

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 1351 2 1 2 1 2 1 4 1 1 2 1

Page 12: Motif  discovery

Revision

• How many distinct amino acid sequences of length 10 exist?– 2010=1.024 x 1012

• Say that you use dynamic programming to compute the distribution of scores for a motif of width 10. You then observe a sequence with a score of 28. Describe how you would compute the p-value of 28 from the output of the dynamic programming.– Compute the sum S of the counts for scores ≥28. – The p-value is S/2010.

Page 13: Motif  discovery

Motif discovery problem

• Given sequences

• Find motif

IGRGGFGEVY at position 515LGEGCFGQVV at position 430VGSGGFGQVY at position 682

seq. 1seq. 2seq. 3

seq. 1seq. 2seq. 3

Page 14: Motif  discovery

Motif discovery problem

• Given:– a sequence or family of sequences.

• Find:– the number of motifs– the width of each motif– the locations of motif occurrences

Page 15: Motif  discovery

Why is this hard?

• Input sequences are long (thousands or millions of residues).

• Motif may be subtle– Instances are short.– Instances are only slightly

similar.?

?

Page 16: Motif  discovery

Globin motifs

xxxxxxxxxxx.xxxxxxxxx.xxxxx..........xxxxxx.xxxxxxx.xxxxxxxxxx.xxxxxxxxxHAHU V.LSPADKTN..VKAAWGKVG.AHAGE..........YGAEAL.ERMFLSF..PTTKTYFPH.FDLS.HGSAHAOR M.LTDAEKKE..VTALWGKAA.GHGEE..........YGAEAL.ERLFQAF..PTTKTYFSH.FDLS.HGSAHADK V.LSAADKTN..VKGVFSKIG.GHAEE..........YGAETL.ERMFIAY..PQTKTYFPH.FDLS.HGSAHBHU VHLTPEEKSA..VTALWGKVN.VDEVG...........G.EAL.GRLLVVY..PWTQRFFES.FGDL.STPDHBOR VHLSGGEKSA..VTNLWGKVN.INELG...........G.EAL.GRLLVVY..PWTQRFFEA.FGDL.SSAGHBDK VHWTAEEKQL..ITGLWGKVNvAD.CG...........A.EAL.ARLLIVY..PWTQRFFAS.FGNL.SSPTMYHU G.LSDGEWQL..VLNVWGKVE.ADIPG..........HGQEVL.IRLFKGH..PETLEKFDK.FKHL.KSEDMYOR G.LSDGEWQL..VLKVWGKVE.GDLPG..........HGQEVL.IRLFKTH..PETLEKFDK.FKGL.KTEDIGLOB M.KFFAVLALCiVGAIASPLT.ADEASlvqsswkavsHNEVEIlAAVFAAY.PDIQNKFSQFaGKDLASIKDGPUGNI A.LTEKQEAL..LKQSWEVLK.QNIPA..........HS.LRL.FALIIEA.APESKYVFSF.LKDSNEIPEGPYL GVLTDVQVAL..VKSSFEEFN.ANIPK...........N.THR.FFTLVLEiAPGAKDLFSF.LKGSSEVPQGGZLB M.L.DQQTIN..IIKATVPVLkEHGVT...........ITTTF.YKNLFAK.HPEVRPLFDM.GRQ..ESLE  xxxxx.xxxxxxxxxxxxx..xxxxxxxxxxxxxxx..xxxxxxx.xxxxxxx...xxxxxxxxxxxxxxxxHAHU QVKGH.GKKVADA.LTN......AVA.HVDDMPNA...LSALS.D.LHAHKL....RVDPVNF.KLLSHCLLHAOR QIKAH.GKKVADA.L.S......TAAGHFDDMDSA...LSALS.D.LHAHKL....RVDPVNF.KLLAHCILHADK QIKAH.GKKVAAA.LVE......AVN.HVDDIAGA...LSKLS.D.LHAQKL....RVDPVNF.KFLGHCFLHBHU AVMGNpKVKAHGK.KVLGA..FSDGLAHLDNLKGT...FATLS.E.LHCDKL....HVDPENF.RL.LGNVLHBOR AVMGNpKVKAHGA.KVLTS..FGDALKNLDDLKGT...FAKLS.E.LHCDKL....HVDPENFNRL..GNVLHBDK AILGNpMVRAHGK.KVLTS..FGDAVKNLDNIKNT...FAQLS.E.LHCDKL....HVDPENF.RL.LGDILMYHU EMKASeDLKKHGA.TVL......TALGGILKKKGHH..EAEIKPL.AQSHATK...HKIPVKYLEFISECIIMYOR EMKASaDLKKHGG.TVL......TALGNILKKKGQH..EAELKPL.AQSHATK...HKISIKFLEYISEAIIIGLOB T.GA...FATHATRIVSFLseVIALSGNTSNAAAV...NSLVSKL.GDDHKA....R.GVSAA.QF..GEFRGPUGNI NNPK...LKAHAAVIFKTI...CESATELRQKGHAVwdNNTLKRL.GSIHLK....N.KITDP.HF.EVMKGGPYL NNPD...LQAHAG.KVFKL..TYEAAIQLEVNGAVAs.DATLKSL.GSVHVS....K.GVVDA.HF.PVVKEGGZLB Q......PKALAM.TVL......AAAQNIENLPAIL..PAVKKIAvKHCQAGVaaaH.YPIVGQEL.LGAIK  xxxxxxxxx.xxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxx..xHAHU VT.LAA.H..LPAEFTPA..VHASLDKFLASV.STVLTS..KY..RHAOR VV.LAR.H..CPGEFTPS..AHAAMDKFLSKV.ATVLTS..KY..RHADK VV.VAI.H..HPAALTPE..VHASLDKFMCAV.GAVLTA..KY..RHBHU VCVLAH.H..FGKEFTPP..VQAAYQKVVAGV.ANALAH..KY..HHBOR IVVLAR.H..FSKDFSPE..VQAAWQKLVSGV.AHALGH..KY..HHBDK IIVLAA.H..FTKDFTPE..CQAAWQKLVRVV.AHALAR..KY..HMYHU QV.LQSKHPgDFGADAQGA.MNKALELFRKDM.ASNYKELGFQ..GMYOR HV.LQSKHSaDFGADAQAA.MGKALELFRNDM.AAKYKEFGFQ..GIGLOB TA.LVA.Y..LQANVSWGDnVAAAWNKA.LDN.TFAIVV..PR..LGPUGNI ALLGTIKEA.IKENWSDE..MGQAWTEAYNQLVATIKAE..MK..EGPYL AILKTIKEV.VGDKWSEE..LNTAWTIAYDELAIIIKKE..MKdaAGGZLB EVLGDAAT..DDILDAWGK.AYGVIADVFIQVEADLYAQ..AV..E

Page 17: Motif  discovery

A Concrete Example: Transcription Factor Binding Sites

• Transcription factor proteins bind to DNA and regulate gene expression.

• The promoter is a region near the start of the gene where transcription factors bind.

Page 18: Motif  discovery

TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

A Concrete Example: Transcription Factor Binding Sites

We are given a set of promoters from co-regulated genes.

Page 19: Motif  discovery

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

A Concrete Example: Transcription Factor Binding Sites

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

An unknown transcription factor binds to positions unknown to us, on either DNA strand.

Page 20: Motif  discovery

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

A Concrete Example: Transcription Factor Binding Sites

The DNA binding motif of the transcription factor can be described by a position-specific scoring matrix (PSSM).

Page 21: Motif  discovery

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

A Concrete Example: Transcription Factor Binding Sites

The sequence motif discovery problem is to discover the sites (or the motif) given just the sequences.

Page 22: Motif  discovery

Gibbs sampling

Page 23: Motif  discovery

Alternating approach

1. Guess an initial weight matrix2. Use weight matrix to predict instances in the

input sequences3. Use instances to predict a weight matrix4. Repeat 2 & 3 until satisfied.

Page 24: Motif  discovery

Initialization

• Randomly guess an instance si from each of t input sequences {S1, ..., St}.

sequence 1

sequence 2

sequence 3

sequence 4

sequence 5

ACAGTGTTTAGACCGTGACCAACCCAGGCAGGTTT

Page 25: Motif  discovery

Gibbs sampler

• Initially: randomly guess an instance si from each of t input sequences {S1, ..., St}.

• Steps 2 & 3 (search):– Throw away an instance si: remaining (t - 1) instances

define weight matrix.– Weight matrix defines instance probability at each position

of input string Si

– Pick new si according to probability distribution

• Return highest-scoring motif seen

Page 26: Motif  discovery

Sampler step illustration:

ACAGTGTTAGGCGTACACCGT???????CAGGTTT

ACGT

.45 .45 .45 .05 .05 .05 .05

.25 .45 .05 .25 .45 .05 .05

.05 .05 .45 .65 .05 .65 .05

.25 .05 .05 .05 .45 .25 .85

ACGCCGT:20% ACGGCGT:52%

ACAGTGTTAGGCGTACACCGTACGCCGTCAGGTTT

sequence 411%

Page 27: Motif  discovery

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

The MEME Algorithm

MEME uses expectation maximization (EM) to discover sequence motifs.

Slides courtesy of Tim Bailey

Page 28: Motif  discovery

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

The MEME Algorithm

The positions (and strands) of the sites are the missing information variables, Z = {Zi,j}.

Page 29: Motif  discovery

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

The MEME Algorithm

If we knew the Z = {Zi,j}, we could estimate the motif PSSM using maximum likelihood.

Page 30: Motif  discovery

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

The MEME Algorithm

Alignment1 AAAAGAGTCA2 AAATGACTCA AAGTGAGTCA AAAAGAGTCA GGATGAGTCAN AAATGAGTCA12 … w

i

j

Frequency MatrixCount Matrix

ACGT

ACGT

12 … w 12 … w

Page 31: Motif  discovery

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

The MEME Algorithm

MEME starts Expectation Maximization from an initial estimate of θ based on a subsequence in the data.

θ(1)

Page 32: Motif  discovery

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

The MEME Algorithm

EM maximizes the expected joint likelihood of the sequences (X) and missing information (Z) under a probabilistic model.

θ(1)

Page 33: Motif  discovery

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT

5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG

5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT

5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC

5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA

5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA

5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA

…HIS7

…ARO4

…ILV6

…THR4

…ARO1

…HOM2

…PRO3

The MEME Algorithm

The current estimates of the parameters of the data model are φ(t), and include θ(t).

θ(1)

Page 34: Motif  discovery

The MEME Algorithm

EM iteratively refines φ(t).

Page 35: Motif  discovery

The MEME Algorithm

E-step: Use φ(t) to estimate the probabilities of each position in the data being a site.

Page 36: Motif  discovery

The MEME Algorithm

M-step: Use Z(t) to find the value of φ that maximizes the expectation of the joint conditional probability.

Page 37: Motif  discovery

Alternating approach

1. Guess an initial weight matrix2. Use weight matrix to predict instances in the

input sequences3. Use instances to predict a weight matrix4. Repeat 2 & 3 until satisfied.

Page 38: Motif  discovery

Expectation-maximization

foreach subsequence of width Wconvert subsequence to a matrixdo {

re-estimate motif occurrences from matrixre-estimate matrix model from motif occurrences

} until (matrix model stops changing)endselect matrix with highest score

EM

Page 39: Motif  discovery

Comparison

• Both EM and Gibbs sampling involve iterating over two steps

• Convergence:– EM converges when the PSSM stops changing.– Gibbs sampling runs until you ask it to stop.

• Solution:– EM may not find the motif with the highest score.– Gibbs sampling will provably find the motif with the

highest score, if you let it run long enough.

Page 40: Motif  discovery

Sample problem #1• You are implementing a Gibbs sampler for motif discovery. Your code has

just scanned a sequence S with a motif PSSM to produce a list of scores x1, …, xn. You will write the code to randomly select the new motif occurrence on the basis of the score list.

• Given:– a list of scores x1, …, xn

• Return:– Print to standard output an integer i in the range 1, …, n and the corresponding

score xi, where i is selected with probability proportional to xi.

> ./randomly-select-4.py scores.txt Read 10000 scores from scores.txt.Sum of scores = 20042.7Random value = 3298.621655 1.41589

Page 41: Motif  discovery

Sample problem #2

• Given:– a list of DNA sequences,– the motif width,– the integer index of the (currently estimated) motif

occurrence within each sequence,– the index j of the sequence for which to update the motif

occurrence, and– the list of scores for sequence j.

• Return:– a new alignment with the motif occurrence for sequence j

randomly replaced.

Page 42: Motif  discovery

>./do-gibbs-step.py crp0.txt 10 crp0-offsets.txt 1 crp0-scores.txt > fooRead 18 sequences from crp0.txt.Read 95 values from crp0-scores.txt.Read 18 values from crp0-indices.txt.Sum of scores = 189.341Random value = 69.6017Selected 36 with score 2.81402.> cat fooTTTGTGGCATCAGAAAAGTCACTGTGAGCATATGCAAAGGGGTGTTAAATATTGTGATGTAATTTATTCCAACGTGATCAAATGTGAGTTTCTGTAACAGTCTGTGAACTATTGTGACACTGCCTGACGGATTGTGATTCGTTGTGATGTGGTGTGAAATAATGAGACGTTTTGTGAGTG