Post on 21-Jan-2016
Embed Size (px)
DESCRIPTIONMotif Search. What are Motifs. Motif (dictionary) A recurrent thematic element, a common theme. Find a common motif in the text. Find a short common motif in the text. Motifs in biological sequences. - PowerPoint PPT Presentation
What are Motifs
Motif (dictionary) A recurrent thematic element, a common theme
Find a common motif in the text
Find a short common motif in the text
Motifs in biological sequencesSequence motifs represent a short common sequence (length 4-20) which is highly represented in the data
Motifs in biological sequences
Regulatory motifs on DNA or RNA Functional sites in proteins
What can we learn from these motifs?
Regulatory Motifs on DNATranscription Factors (TF) are regulatory protein that bind to regulatory motifs near the gene and act as a switch bottom (on/off)
TF binding motifs are usually 6 20 nucleotides longlocated near target gene, mostly upstream the transcription start siteTranscription Start SiteTF2motifTF1motifGene XTF1TF2
What can we learn from these motifs?
About half of all cancer patients have a mutation in a gene called p53 which codes for a key Transcription factors.The mutations are in the DNA binding region and allowstumors to survive and continue growing even after chemotherapy severely damages their DNAP53 Transcription FactorTarget GeneBinding sites (moifs)
Why is P53 involved in so many cancer types?
We are interested to identify the genes regulated by p53p53 regulated over 100 different genes (hub)
Can we find TF targets using a bioinformatics approach?
Finding TF targets using a bioinformatics approach?Scenario 1 : Binding motif is known (easier case)
Scenario 2 : Binding motif is unknown (hard case)
Scenario 1 : Binding motif is known
Given a motif find the binding sites in an input sequence
Challenges in biological sequencesMotifs are usually not exact words
How to present non exact motifs?
How to present non exact motifs?Consensus string NTAHAWTMay allow degenerate symbols in string, e.g., N = A/C/G/T; W = A/T; H=not G; S = C/G; R = A/G; Y = T/C etc.Position Specific Scoring Matrix (PSSM)Probability for each base in each position ATGC123456
Given a consensus :For each position l in the input sequence, check if substring starting at position l matches the motif. Example: find the consensus motif NTAHAWT in the promoter of a gene
>promoter of gene AACGCGTATATTACGGGTACACCCTCCCAATTACTACTATAAATTCATACGGACTCAGACCTTAAAA.
Given a PSSM:Seq 1 AAAGCCCSeq 2 CTATCCASeq 3 CTATCCCSeq 4 CTATCCCSeq 5 GTATCCCSeq 6 CTATCCCSeq 7 CTATCCCSeq 8 CTATCCCSeq 9 TTATCTGStarting from a set of aligned motifs
Given a PSSM:WCounts of each baseIn each columnProbability of each baseIn each columnWk = probability of base in column k
Given a PSSM:Given sequence S (e.g., 1000 base-pairs long)For each substring s of S, Compute Pr(s|W)If Pr(s|W) > some threshold, call that a binding siteIn DNA sequences we need to search both strands AGTTACACCA TGGTGTAACT (reverse complement)Seq1 :AAAACGTGCGTAGCAGTTACACCAACTCTA TTTTGCACGCATCGTCAATGTGGTTGAGAT
Seq2 :ACTTACTACTGGTGTAACTATATATTTTCG TGAATGATGACCACATTGATATATAAAAGC
Scenario 2 : Binding motif is unknown
Ab initio motif finding
Ab initio motif finding: Expectation MaximizationLocal search algorithm - Start from a random PWMMove from one PWM to another so as to improve the score which fits the sequence to the motifKeep doing this until no more improvement is obtained : Convergence to local optima
Expectation MaximizationLet W be a PWM . Let S be the input sequence . Imagine a process that randomly searches, picks different strings matching W and threads them together to a new PWM
Expectation MaximizationFind W so as to maximize Pr(S|W) The Expectation-Maximization (EM) algorithm iteratively finds a new motif W that improves Pr(S|W)
The final PSSM represents the motif which is mostly enriched in the data
-A letters height indicates the information it contains The PSSM can be also represented as a sequence logo
Presenting a sequence motif as a logoTTCACGTACATGTACAGGTACAAG
T position 1=Log24=2T position 5=Log21=0
Divide each score by backgroundprobability 0.25
Are common motifs the right thing to search for ?
Solutions:-Searching for motifs which are enriched in one set but not in a random set
- Use experimental information to rank the sequences according to their binding affinity and search for enriched motifs at the top of the list
Sequencing the regions in the genome to which a protein (e.g. transcription factor) binds to.ChIP-Seq
ChIP SEQ BestBindersWeakBindersFinding the p53 binding motif in a set of p53 target sequences which are ranked according to binding affinity
CTGTGCCTGTGACTGTGACTGTGACTGTGACTGTGACTGTGAa word search approach to search for enriched motif in a ranked listCTGTGACTGTGA
CTGTGACTGTGACTGTGACTGTGACTGTGACTGTGACTGTGACTGTGA uses the minimal hyper geometric statistics (mHG) to find enriched motifs
The enriched motifs are combined to get a PSSM which represents the binding motif
P[ED]XK[RW][RK]X[ED]Protein MotifsProtein motifs are usually 6-20 amino acids long andcan be represented as a consensus/profile:or as PWM