motif search

Download Motif Search

Post on 21-Jan-2016

25 views

Category:

Documents

0 download

Embed Size (px)

DESCRIPTION

Motif Search. What are Motifs. Motif (dictionary) A recurrent thematic element, a common theme. Find a common motif in the text. Find a short common motif in the text. Motifs in biological sequences. - PowerPoint PPT Presentation

TRANSCRIPT

  • Motif Search

  • What are Motifs

    Motif (dictionary) A recurrent thematic element, a common theme

  • Find a common motif in the text

  • Find a short common motif in the text

  • Motifs in biological sequencesSequence motifs represent a short common sequence (length 4-20) which is highly represented in the data

  • Motifs in biological sequences

    Regulatory motifs on DNA or RNA Functional sites in proteins

    What can we learn from these motifs?

  • Regulatory Motifs on DNATranscription Factors (TF) are regulatory protein that bind to regulatory motifs near the gene and act as a switch bottom (on/off)

    TF binding motifs are usually 6 20 nucleotides longlocated near target gene, mostly upstream the transcription start siteTranscription Start SiteTF2motifTF1motifGene XTF1TF2

  • What can we learn from these motifs?

    About half of all cancer patients have a mutation in a gene called p53 which codes for a key Transcription factors.The mutations are in the DNA binding region and allowstumors to survive and continue growing even after chemotherapy severely damages their DNAP53 Transcription FactorTarget GeneBinding sites (moifs)

  • Why is P53 involved in so many cancer types?

    We are interested to identify the genes regulated by p53p53 regulated over 100 different genes (hub)

  • Can we find TF targets using a bioinformatics approach?

  • Finding TF targets using a bioinformatics approach?Scenario 1 : Binding motif is known (easier case)

    Scenario 2 : Binding motif is unknown (hard case)

  • Scenario 1 : Binding motif is known

    Given a motif find the binding sites in an input sequence

  • Challenges in biological sequencesMotifs are usually not exact words

    .

  • How to present non exact motifs?

  • How to present non exact motifs?Consensus string NTAHAWTMay allow degenerate symbols in string, e.g., N = A/C/G/T; W = A/T; H=not G; S = C/G; R = A/G; Y = T/C etc.Position Specific Scoring Matrix (PSSM)Probability for each base in each position ATGC123456

  • Given a consensus :For each position l in the input sequence, check if substring starting at position l matches the motif. Example: find the consensus motif NTAHAWT in the promoter of a gene

    >promoter of gene AACGCGTATATTACGGGTACACCCTCCCAATTACTACTATAAATTCATACGGACTCAGACCTTAAAA.

  • Given a PSSM:Seq 1 AAAGCCCSeq 2 CTATCCASeq 3 CTATCCCSeq 4 CTATCCCSeq 5 GTATCCCSeq 6 CTATCCCSeq 7 CTATCCCSeq 8 CTATCCCSeq 9 TTATCTGStarting from a set of aligned motifs

  • Given a PSSM:WCounts of each baseIn each columnProbability of each baseIn each columnWk = probability of base in column k

    11990001A60000987C10001001G18008010T

    .11.1111000.11A.6700001.89.78C.11000.1100.11G.11.8900.890.110T

  • Given a PSSM:Given sequence S (e.g., 1000 base-pairs long)For each substring s of S, Compute Pr(s|W)If Pr(s|W) > some threshold, call that a binding siteIn DNA sequences we need to search both strands AGTTACACCA TGGTGTAACT (reverse complement)Seq1 :AAAACGTGCGTAGCAGTTACACCAACTCTA TTTTGCACGCATCGTCAATGTGGTTGAGAT

    Seq2 :ACTTACTACTGGTGTAACTATATATTTTCG TGAATGATGACCACATTGATATATAAAAGC

  • Scenario 2 : Binding motif is unknown

    Ab initio motif finding

  • Ab initio motif finding: Expectation MaximizationLocal search algorithm - Start from a random PWMMove from one PWM to another so as to improve the score which fits the sequence to the motifKeep doing this until no more improvement is obtained : Convergence to local optima

  • Expectation MaximizationLet W be a PWM . Let S be the input sequence . Imagine a process that randomly searches, picks different strings matching W and threads them together to a new PWM

  • Expectation MaximizationFind W so as to maximize Pr(S|W) The Expectation-Maximization (EM) algorithm iteratively finds a new motif W that improves Pr(S|W)

  • Expectation Maximization

  • The final PSSM represents the motif which is mostly enriched in the data

    -A letters height indicates the information it contains The PSSM can be also represented as a sequence logo

  • Presenting a sequence motif as a logoTTCACGTACATGTACAGGTACAAG

    PSSMLetter Height

    Log2SPWM

    T position 1=Log24=2T position 5=Log21=0

    Divide each score by backgroundprobability 0.25

    123456A030110G000014C004010T410010

    123456A00.75010.250G00000.251C00100.250T10.25000.250

  • ??

  • Are common motifs the right thing to search for ?

  • ?

  • Solutions:-Searching for motifs which are enriched in one set but not in a random set

    - Use experimental information to rank the sequences according to their binding affinity and search for enriched motifs at the top of the list

  • Sequencing the regions in the genome to which a protein (e.g. transcription factor) binds to.ChIP-Seq

  • ChIP SEQ BestBindersWeakBindersFinding the p53 binding motif in a set of p53 target sequences which are ranked according to binding affinity

  • CTGTGCCTGTGACTGTGACTGTGACTGTGACTGTGACTGTGAa word search approach to search for enriched motif in a ranked listCTGTGACTGTGA

  • CTGTGACTGTGACTGTGACTGTGACTGTGACTGTGACTGTGACTGTGA uses the minimal hyper geometric statistics (mHG) to find enriched motifs

  • The enriched motifs are combined to get a PSSM which represents the binding motif

  • P[ED]XK[RW][RK]X[ED]Protein MotifsProtein motifs are usually 6-20 amino acids long andcan be represented as a consensus/profile:or as PWM

    ***********