motif discovery

Download Motif  discovery

Post on 24-Feb-2016

49 views

Category:

Documents

0 download

Embed Size (px)

DESCRIPTION

Motif discovery. Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington thabangh@gmail.com. Outline. One-minute response Revision Motifs Gibbs sampling Expectation maximization Python. One-minute responses. - PowerPoint PPT Presentation

TRANSCRIPT

Multiple testing correction

Motif discoveryProf. William Stafford NobleDepartment of Genome SciencesDepartment of Computer Science and EngineeringUniversity of Washington

thabangh@gmail.com

OutlineOne-minute responseRevisionMotifsGibbs samplingExpectation maximization PythonOne-minute responsesAre we able to read from the smooth curve that we learned to create today what is a good enough p-value?No, the curve gives you the p-value, but deciding what is good enough requires that you know the costs associated with false positives and false negatives.The concepts were clear (for today).Keep doing more explanation on the board.Can you please explain the last part of converting scores to p-values?Continue with revision every day.

Other questions and commentsWe would prefer to know how we did on the first assessment before you give us the second assessment.

Converting scores to p-valuesSay that your motif has N rows. Create a matrix that has N rows and 100N columns.The entry in row i, column j is the number of different sequences of length i that can have a score of j.A 10 67 59 44C 60 39 49 29G 0 71 50 54T 100 43 13 640 1 2 3 4 400Converting scores to p-valuesFor each value in the first column of your motif, put a 1 in the corresponding entry in the first row of the matrix.There are only 4 possible sequences of length 1.A 10 67 59 44C 60 39 49 29G 0 71 50 54T 100 43 13 640 1 2 3 4 10 60 100 4001111Converting scores to p-valuesFor each value x in the second column of your motif, consider each value y in the zth column of the first row of the matrix.Add y to the x+zth column of the matrix.A 10 67 59 44C 60 39 49 29G 0 71 50 54T 100 43 13 64110 1 2 3 4 10 60 77 100 400111Converting scores to p-valuesFor each value x in the second column of your motif, consider each value y in the zth column of the first row of the matrix.Add y to the x+zth column of the matrix.What values will go in row 2?10+67, 10+39, 10+71, 10+43, 60+67, , 100+43These 16 values correspond to all 16 strings of length 2.A 10 67 59 44C 60 39 49 29G 0 71 50 54T 100 43 13 64110 1 2 3 4 10 60 77 100 400111Converting scores to p-valuesIn the end, the bottom row contains the scores for all possible sequences of length N.Use these scores to compute a p-value.A 10 67 59 44C 60 39 49 29G 0 71 50 54T 100 43 13 64110 1 2 3 4 10 60 77 100 400111Sample problemList the scores of all 4 length-1 DNA sequences relative to this motif:A 10 15 100 80C 100 30 7 21G 0 30 22 14T 10 35 9 51A=10, C=100, G=0, T=10List the scores of all 16 length-2 DNA sequences relative to the same motif.AA=15, AC=40, AG=40, AT=45, CA=115, CC=130, CG=130, CT=135, GA=15, GC=30, GG=30, GT=35, TA=25, TC=40, TG=40, TT=45 Draw the dynamic programming matrix for this motif and indicate where the 20 scores you computed would go.15x2, 25, 30x2, 40x4Sample problemDraw the dynamic programming matrix for this motif and indicate where the 20 scores you computed would go.AA=15, GA=15, TA=25, GC=30, GG=30, GT=35, AC=40, AG=40, TC=40, TG=40, AT=45, TT=45, CA=115, CC=130, CG=130, CT=135How many distinct scores do you observe for sequences of length 2?9How many calculations will you need to perform to compute scores for sequences of length 3?9 x 4 = 360 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 1351 2 1 2 1 2 1 4 1 1 2 1 RevisionHow many distinct amino acid sequences of length 10 exist?2010=1.024 x 1012Say that you use dynamic programming to compute the distribution of scores for a motif of width 10. You then observe a sequence with a score of 28. Describe how you would compute the p-value of 28 from the output of the dynamic programming.Compute the sum S of the counts for scores 28. The p-value is S/2010.

Motif discovery problemGiven sequences

Find motif IGRGGFGEVY at position 515LGEGCFGQVV at position 430VGSGGFGQVY at position 682seq. 1seq. 2seq. 3seq. 1seq. 2seq. 313Motif discovery problemGiven:a sequence or family of sequences.Find:the number of motifsthe width of each motifthe locations of motif occurrences

14Why is this hard?Input sequences are long (thousands or millions of residues).Motif may be subtleInstances are short.Instances are only slightly similar.??15Globin motifs xxxxxxxxxxx.xxxxxxxxx.xxxxx..........xxxxxx.xxxxxxx.xxxxxxxxxx.xxxxxxxxxHAHU V.LSPADKTN..VKAAWGKVG.AHAGE..........YGAEAL.ERMFLSF..PTTKTYFPH.FDLS.HGSAHAOR M.LTDAEKKE..VTALWGKAA.GHGEE..........YGAEAL.ERLFQAF..PTTKTYFSH.FDLS.HGSAHADK V.LSAADKTN..VKGVFSKIG.GHAEE..........YGAETL.ERMFIAY..PQTKTYFPH.FDLS.HGSAHBHU VHLTPEEKSA..VTALWGKVN.VDEVG...........G.EAL.GRLLVVY..PWTQRFFES.FGDL.STPDHBOR VHLSGGEKSA..VTNLWGKVN.INELG...........G.EAL.GRLLVVY..PWTQRFFEA.FGDL.SSAGHBDK VHWTAEEKQL..ITGLWGKVNvAD.CG...........A.EAL.ARLLIVY..PWTQRFFAS.FGNL.SSPTMYHU G.LSDGEWQL..VLNVWGKVE.ADIPG..........HGQEVL.IRLFKGH..PETLEKFDK.FKHL.KSEDMYOR G.LSDGEWQL..VLKVWGKVE.GDLPG..........HGQEVL.IRLFKTH..PETLEKFDK.FKGL.KTEDIGLOB M.KFFAVLALCiVGAIASPLT.ADEASlvqsswkavsHNEVEIlAAVFAAY.PDIQNKFSQFaGKDLASIKDGPUGNI A.LTEKQEAL..LKQSWEVLK.QNIPA..........HS.LRL.FALIIEA.APESKYVFSF.LKDSNEIPEGPYL GVLTDVQVAL..VKSSFEEFN.ANIPK...........N.THR.FFTLVLEiAPGAKDLFSF.LKGSSEVPQGGZLB M.L.DQQTIN..IIKATVPVLkEHGVT...........ITTTF.YKNLFAK.HPEVRPLFDM.GRQ..ESLE xxxxx.xxxxxxxxxxxxx..xxxxxxxxxxxxxxx..xxxxxxx.xxxxxxx...xxxxxxxxxxxxxxxxHAHU QVKGH.GKKVADA.LTN......AVA.HVDDMPNA...LSALS.D.LHAHKL....RVDPVNF.KLLSHCLLHAOR QIKAH.GKKVADA.L.S......TAAGHFDDMDSA...LSALS.D.LHAHKL....RVDPVNF.KLLAHCILHADK QIKAH.GKKVAAA.LVE......AVN.HVDDIAGA...LSKLS.D.LHAQKL....RVDPVNF.KFLGHCFLHBHU AVMGNpKVKAHGK.KVLGA..FSDGLAHLDNLKGT...FATLS.E.LHCDKL....HVDPENF.RL.LGNVLHBOR AVMGNpKVKAHGA.KVLTS..FGDALKNLDDLKGT...FAKLS.E.LHCDKL....HVDPENFNRL..GNVLHBDK AILGNpMVRAHGK.KVLTS..FGDAVKNLDNIKNT...FAQLS.E.LHCDKL....HVDPENF.RL.LGDILMYHU EMKASeDLKKHGA.TVL......TALGGILKKKGHH..EAEIKPL.AQSHATK...HKIPVKYLEFISECIIMYOR EMKASaDLKKHGG.TVL......TALGNILKKKGQH..EAELKPL.AQSHATK...HKISIKFLEYISEAIIIGLOB T.GA...FATHATRIVSFLseVIALSGNTSNAAAV...NSLVSKL.GDDHKA....R.GVSAA.QF..GEFRGPUGNI NNPK...LKAHAAVIFKTI...CESATELRQKGHAVwdNNTLKRL.GSIHLK....N.KITDP.HF.EVMKGGPYL NNPD...LQAHAG.KVFKL..TYEAAIQLEVNGAVAs.DATLKSL.GSVHVS....K.GVVDA.HF.PVVKEGGZLB Q......PKALAM.TVL......AAAQNIENLPAIL..PAVKKIAvKHCQAGVaaaH.YPIVGQEL.LGAIK xxxxxxxxx.xxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxx..xHAHU VT.LAA.H..LPAEFTPA..VHASLDKFLASV.STVLTS..KY..RHAOR VV.LAR.H..CPGEFTPS..AHAAMDKFLSKV.ATVLTS..KY..RHADK VV.VAI.H..HPAALTPE..VHASLDKFMCAV.GAVLTA..KY..RHBHU VCVLAH.H..FGKEFTPP..VQAAYQKVVAGV.ANALAH..KY..HHBOR IVVLAR.H..FSKDFSPE..VQAAWQKLVSGV.AHALGH..KY..HHBDK IIVLAA.H..FTKDFTPE..CQAAWQKLVRVV.AHALAR..KY..HMYHU QV.LQSKHPgDFGADAQGA.MNKALELFRKDM.ASNYKELGFQ..GMYOR HV.LQSKHSaDFGADAQAA.MGKALELFRNDM.AAKYKEFGFQ..GIGLOB TA.LVA.Y..LQANVSWGDnVAAAWNKA.LDN.TFAIVV..PR..LGPUGNI ALLGTIKEA.IKENWSDE..MGQAWTEAYNQLVATIKAE..MK..EGPYL AILKTIKEV.VGDKWSEE..LNTAWTIAYDELAIIIKKE..MKdaAGGZLB EVLGDAAT..DDILDAWGK.AYGVIADVFIQVEADLYAQ..AV..E

16A Concrete Example: Transcription Factor Binding SitesTranscription factor proteins bind to DNA and regulate gene expression.The promoter is a region near the start of the gene where transcription factors bind.

17TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACATATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACGCACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTTTGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATCACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAAATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAAGGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACAHIS7 ARO4ILV6THR4ARO1HOM2PRO3A Concrete Example: Transcription Factor Binding SitesWe are given a set of promoters from co-regulated genes.5- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT5- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG5- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT5- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC5- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA5- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA5- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACAHIS7 ARO4ILV6THR4ARO1HOM2PRO3A Concrete Example: Transcription Factor Binding Sites5- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT5- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG5- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT5- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC5- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA5- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA5- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACAHIS7 ARO4ILV6THR4ARO1HOM2PRO3An unknown transcription factor binds to positions unknown to us, on either DNA strand.5- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT5- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG5- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT5- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC5- ACAAAGGTACCTTCCTGGCCAATCTCACAGAT

Recommended

View more >