# motif discovery

Motif discovery. Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington. Outline. One-minute response Revision Motifs Gibbs sampling Expectation maximization Python.

Multiple testing correction

Motif discoveryProf. William Stafford NobleDepartment of Genome SciencesDepartment of Computer Science and EngineeringUniversity of Washington

OutlineOne-minute responseRevisionMotifsGibbs samplingExpectation maximization PythonOne-minute responsesAre we able to read from the smooth curve that we learned to create today what is a good enough p-value?No, the curve gives you the p-value, but deciding what is good enough requires that you know the costs associated with false positives and false negatives.The concepts were clear (for today).Keep doing more explanation on the board.Can you please explain the last part of converting scores to p-values?Continue with revision every day.

Other questions and commentsWe would prefer to know how we did on the first assessment before you give us the second assessment.

Converting scores to p-valuesSay that your motif has N rows. Create a matrix that has N rows and 100N columns.The entry in row i, column j is the number of different sequences of length i that can have a score of j.A 10 67 59 44C 60 39 49 29G 0 71 50 54T 100 43 13 640 1 2 3 4 400Converting scores to p-valuesFor each value in the first column of your motif, put a 1 in the corresponding entry in the first row of the matrix.There are only 4 possible sequences of length 1.A 10 67 59 44C 60 39 49 29G 0 71 50 54T 100 43 13 640 1 2 3 4 10 60 100 4001111Converting scores to p-valuesFor each value x in the second column of your motif, consider each value y in the zth column of the first row of the matrix.Add y to the x+zth column of the matrix.A 10 67 59 44C 60 39 49 29G 0 71 50 54T 100 43 13 64110 1 2 3 4 10 60 77 100 400111Converting scores to p-valuesFor each value x in the second column of your motif, consider each value y in the zth column of the first row of the matrix.Add y to the x+zth column of the matrix.What values will go in row 2?10+67, 10+39, 10+71, 10+43, 60+67, , 100+43These 16 values correspond to all 16 strings of length 2.A 10 67 59 44C 60 39 49 29G 0 71 50 54T 100 43 13 64110 1 2 3 4 10 60 77 100 400111Converting scores to p-valuesIn the end, the bottom row contains the scores for all possible sequences of length N.Use these scores to compute a p-value.A 10 67 59 44C 60 39 49 29G 0 71 50 54T 100 43 13 64110 1 2 3 4 10 60 77 100 400111Sample problemList the scores of all 4 length-1 DNA sequences relative to this motif:A 10 15 100 80C 100 30 7 21G 0 30 22 14T 10 35 9 51A=10, C=100, G=0, T=10List the scores of all 16 length-2 DNA sequences relative to the same motif.AA=15, AC=40, AG=40, AT=45, CA=115, CC=130, CG=130, CT=135, GA=15, GC=30, GG=30, GT=35, TA=25, TC=40, TG=40, TT=45 Draw the dynamic programming matrix for this motif and indicate where the 20 scores you computed would go.15x2, 25, 30x2, 40x4Sample problemDraw the dynamic programming matrix for this motif and indicate where the 20 scores you computed would go.AA=15, GA=15, TA=25, GC=30, GG=30, GT=35, AC=40, AG=40, TC=40, TG=40, AT=45, TT=45, CA=115, CC=130, CG=130, CT=135How many distinct scores do you observe for sequences of length 2?9How many calculations will you need to perform to compute scores for sequences of length 3?9 x 4 = 360 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 1351 2 1 2 1 2 1 4 1 1 2 1 RevisionHow many distinct amino acid sequences of length 10 exist?2010=1.024 x 1012Say that you use dynamic programming to compute the distribution of scores for a motif of width 10. You then observe a sequence with a score of 28. Describe how you would compute the p-value of 28 from the output of the dynamic programming.Compute the sum S of the counts for scores 28. The p-value is S/2010.

Motif discovery problemGiven sequences

Find motif IGRGGFGEVY at position 515LGEGCFGQVV at position 430VGSGGFGQVY at position 682seq. 1seq. 2seq. 3seq. 1seq. 2seq. 313Motif discovery problemGiven:a sequence or family of sequences.Find:the number of motifsthe width of each motifthe locations of motif occurrences