sequence motifs. what are sequence motifs? sequences are translated into electron densities with...

43
Sequence motifs

Post on 19-Dec-2015

228 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

Sequence motifs

Page 2: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

What are sequence motifs?

• Sequences are translated into electron densities with different affinities of interacting with other molecules.

• Motifs represent a short common sequence– Regulatory motifs (TF binding sites) – Functional site in proteins (DNA binding motif)

Page 3: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

DNA Regulatory Motifs• Transcription Factors bind to regulatory

motifs with high affinity– TF binding motifs are usually 6 – 20

nucleotides long– Usually located near target gene, mostly

upstream the transcription start site

Transcription Start Site

SBFmotif

MCM1motif

Gene X

MCM1 SBF

Page 4: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

Identification of Known Motifs within Genomic Sequences

• Main Motivation: - Identifying the target of regulatory proteins

(e.g. Transcription Factors) in the cell

In many cancers specific TFs are known to be mutated. How do we identify the genes that are affected downstream?

Page 5: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

P53 the guardian of the cell

Page 6: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

How can we start looking for p53 (or any other transcription factor) targets

using bioinformatics?

Scenario 1 : Binding motif is known (easier case)

Scenario 2 : Binding motif is unknown (hard case)

Page 7: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

Challenges

• How to recognize a regulatory motif?• Can we identify new occurrences of known

motifs in genome sequences?• Can we discover new motifs within

upstream sequences of genes?

Page 8: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

Scenario 1 : Binding targets are known

Page 9: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

1. Motif Representation

• Consensus: represent only ‘common’ nucleotides

• NANCATNNCCTTTTTATACAGNNNTTNNNTNN

• N stands for any nucleotide.

• Representing only consensus loses information. How can this be avoided?

GTTCTTCGTGTTTATTTTTAGGAAATTGATGATTGTTTCTCCTTTTAAAATAGTACTGCTGTTTTTTACTAACGACACATTGAAGAAATCACTTTGGATACGCTTACCGTTATCCAGAGCTACAGCGCTACTAATATGTAATACTTCAGCTCCCCTTAATATTGAGATCTTTTTTAACTAGTTAGGTCTACCTTCTCCCCTTCTTCATTTTAGCCTGTTTGGACTAACATAACTTATTTACATAGTGCCATTGAACGATATTTCCCGTTGTGTTAAGGCTGAGAAGAATTTTCCCGACCATCAAGACAGGTGATTTATCATGCAAAAACTTTTTTTCACAGGGCTAACTTGCGTTTATTGTGTTTCCACTCAGTTAAAAAACGAAACGTACTTTAATATTTATAGTACTTCATTCGAACATGCTATTTTTCATACAGCAACCTCACATCTGCACTCATCATTAGATTAGAGGAACATGGATACTTTTCTTTATCTAAGCAGCTAACTCAACTATCAACATGCTATTGAACTAGAGATCCACCTATAACTAACATGACTTTAACAGGGCTAATTTACAGTACTAACTAATTAACTTAGAACATTAACATGATCACCGTCACATTTATTAGAATTTCAAACGCAGTGGAATTTTTTTTTCTAGAAATGGTATCGCTCTATGACCAATAAAAACAGACTGTACTTTCAAATGGTATTATTTATAACAGTTGAACATTTCATAAATATGCGATCAATATAGACCGTTGATATATTTTACTTTTTTTTTTTTAGGAGCTCCAAGAATTTATTTCCTTATAATACAGACACGGTTACATCGCAATTAATTTTCTAATAGTTTTTCATTTTGACCATCTTTCTTTTCCCCAGTGCTAAACACGAACCTTCTTTCTCATTCGTAGATTACTGTTGCAATTACTAACAGCTGTAATAGCCGACAAATTTCTCTCTGCGCGTCCAATTTAGCTATACTGTTGTTGTTTTGTTTTGTCGTACAGTGTTTGGAGAAAAACTTCCATTTCTTACATAGATCATCGCCATTCCTTTCCATAATTTATTCAGCGCTTTGGTATCGATTTACTATTTCCATTTAGACGTTGTTCAAAATTTACTAACAATACTTCAGTTTATAATGGATCCTATACTAACAATTTGTAGTTCATAAATAA

Page 10: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

Entropy - Definition

Claude E. Shannon 1948, “A mathematical theory of communication”.

Page 11: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

Entropy - Definition

Page 12: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

Entropy - Example

Page 13: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

Relative EntropyThe Kullback-Leibler distance D

Page 14: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

Information content

Page 15: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

Information content

Page 16: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

Information content

Page 17: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

GTTCTTCGTGTTTATTTTTAGGAAATTGATGATTGTTTCTCCTTTTAAAATAGTACTGCTGTTTTTTACTAACGACACATTGAAGAAATCACTTTGGATACGCTTACCGTTATCCAGAGCTACAGCGCTACTAATATGTAATACTTCAGCTCCCCTTAATATTGAGATCTTTTTTAACTAGTTAGGTCTACCTTCTCCCCTTCTTCATTTTAGCCTGTTTGGACTAACATAACTTATTTACATAGTGCCATTGAACGATATTTCCCGTTGTGTTAAGGCTGAGAAGAATTTTCCCGACCATCAAGACAGGTGATTTATCATGCAAAAACTTTTTTTCACAGGGCTAACTTGCGTTTATTGTGTTTCCACTCAGTTAAAAAACGAAACGTACTTTAATATTTATAGTACTTCATTCGAACATGCTATTTTTCATACAGCAACCTCACATCTGCACTCATCATTAGATTAGAGGAACATGGATACTTTTCTTTATCTAAGCAGCTAACTCAACTATCAACATGCTATTGAACTAGAGATCCACCTATAACTAACATGACTTTAACAGGGCTAATTTACAGTACTAACTAATTAACTTAGAACATTAACATGATCACCGTCACATTTATTAGAATTTCAAACGCAGTGGAATTTTTTTTTCTAGAAATGGTATCGCTCTATGACCAATAAAAACAGACTGTACTTTCAAATGGTATTATTTATAACAGTTGAACATTTCATAAATATGCGATCAATATAGACCGTTGATATATTTTACTTTTTTTTTTTTAGGAGCTCCAAGAATTTATTTCCTTATAATACAGACACGGTTACATCGCAATTAATTTTCTAATAGTTTTTCATTTTGACCATCTTTCTTTTCCCCAGTGCTAAACACGAACCTTCTTTCTCATTCGTAGATTACTGTTGCAATTACTAACAGCTGTAATAGCCGACAAATTTCTCTCTGCGCGTCCAATTTAGCTATACTGTTGTTGTTTTGTTTTGTCGTACAGTGTTTGGAGAAAAACTTCCATTTCTTACATAGATCATCGCCATTCCTTTCCATAATTTATTCAGCGCTTTGGTATCGATTTACTATTTCCATTTAGACGTTGTTCAAAATTTACTAACAATACTTCAGTTTATAATGGATCCTATACTAACAATTTGTAGTTCATAAATAA

A 94 88 84 75 78 78 71 69 70 60 68 77 32 49 87 93 93 134 9 266 0 86 66 85 81 89 81 88 82

C 31 45 52 44 56 46 62 54 56 51 46 37 30 42 32 44 30 25 122 1 0 38 65 52 43 62 62 57 43

T 113 110 113 117 104 117 111 120 118 125 136 140 182 155 122 100 124 75 137 0 0 72 85 82 91 83 73 67 96

G 30 25 19 32 30 27 24 25 24 32 18 14 24 22 27 31 21 34 0 1 268 72 52 49 53 34 52 56 47

Count nucleotides at each position:

A 0,35 0,33 0,31 0,28 0,29 0,29 0,26 0,26 0,26 0,22 0,25 0,29 0,12 0,18 0,32 0,35 0,35 0,50 0,03 0,99 0,00 0,32 0,25 0,32 0,30 0,33 0,30 0,33 0,31

C 0,12 0,17 0,19 0,16 0,21 0,17 0,23 0,20 0,21 0,19 0,17 0,14 0,11 0,16 0,12 0,16 0,11 0,09 0,46 0,00 0,00 0,14 0,24 0,19 0,16 0,23 0,23 0,21 0,16

T 0,42 0,41 0,42 0,44 0,39 0,44 0,41 0,45 0,44 0,47 0,51 0,52 0,68 0,58 0,46 0,37 0,46 0,28 0,51 0,00 0,00 0,27 0,32 0,31 0,34 0,31 0,27 0,25 0,36

G 0,11 0,09 0,07 0,12 0,11 0,10 0,09 0,09 0,09 0,12 0,07 0,05 0,09 0,08 0,10 0,12 0,08 0,13 0,00 0,00 1,00 0,27 0,19 0,18 0,20 0,13 0,19 0,21 0,18

Convert to frequencies:

Frequency-logo:

Logo plots - HowTo

Page 18: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

GTTCTTCGTGTTTATTTTTAGGAAATTGATGATTGTTTCTCCTTTTAAAATAGTACTGCTGTTTTTTACTAACGACACATTGAAGAAATCACTTTGGATACGCTTACCGTTATCCAGAGCTACAGCGCTACTAATATGTAATACTTCAGCTCCCCTTAATATTGAGATCTTTTTTAACTAGTTAGGTCTACCTTCTCCCCTTCTTCATTTTAGCCTGTTTGGACTAACATAACTTATTTACATAGTGCCATTGAACGATATTTCCCGTTGTGTTAAGGCTGAGAAGAATTTTCCCGACCATCAAGACAGGTGATTTATCATGCAAAAACTTTTTTTCACAGGGCTAACTTGCGTTTATTGTGTTTCCACTCAGTTAAAAAACGAAACGTACTTTAATATTTATAGTACTTCATTCGAACATGCTATTTTTCATACAGCAACCTCACATCTGCACTCATCATTAGATTAGAGGAACATGGATACTTTTCTTTATCTAAGCAGCTAACTCAACTATCAACATGCTATTGAACTAGAGATCCACCTATAACTAACATGACTTTAACAGGGCTAATTTACAGTACTAACTAATTAACTTAGAACATTAACATGATCACCGTCACATTTATTAGAATTTCAAACGCAGTGGAATTTTTTTTTCTAGAAATGGTATCGCTCTATGACCAATAAAAACAGACTGTACTTTCAAATGGTATTATTTATAACAGTTGAACATTTCATAAATATGCGATCAATATAGACCGTTGATATATTTTACTTTTTTTTTTTTAGGAGCTCCAAGAATTTATTTCCTTATAATACAGACACGGTTACATCGCAATTAATTTTCTAATAGTTTTTCATTTTGACCATCTTTCTTTTCCCCAGTGCTAAACACGAACCTTCTTTCTCATTCGTAGATTACTGTTGCAATTACTAACAGCTGTAATAGCCGACAAATTTCTCTCTGCGCGTCCAATTTAGCTATACTGTTGTTGTTTTGTTTTGTCGTACAGTGTTTGGAGAAAAACTTCCATTTCTTACATAGATCATCGCCATTCCTTTCCATAATTTATTCAGCGCTTTGGTATCGATTTACTATTTCCATTTAGACGTTGTTCAAAATTTACTAACAATACTTCAGTTTATAATGGATCCTATACTAACAATTTGTAGTTCATAAATAA

• Multiple alignment of acceptor sites from 268 yeast DNA sequences

– What is the biological signal around the site ?– What are the important positions– How can it be visualized ?

Biological information

Sequence-logo

• Logo plot with Information Content

Exon Intron Exon

Page 19: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

Logo plots - Information Content

Sequence-logo

Calculate Information ContentI = apalog2pa + log2(4), Maximal value is 2 bits

• X axis – Relative position. Y axis – Cross Entropy.• Total height at a position is the Information Content measured in bits.• Height of letter is the proportional to the frequency of that letter.• Stack order indicates importance, consensus is read at the top.• A Logo plot is a visualization of a multiple alignment.

~0.5 each

Completely conserved

Page 20: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

Pseudocounts

Page 21: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

PSSM – Position Specific Scoring Matrix

• Besides Entropy and Information content there are other ways to express a motif

-4 -3 -2 -1 0

A0.18 0.2 - 1 -

C0.05 0.02 0.5 - -

T 0.2 0.180.5 - -

G0.02 0.05 - - 1

Page 22: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

Example:

Predicting the cAMP Receptor Protein (CRP) binding site motif by using a logo plot

Page 23: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

Extract experimentally defined CRP Binding Sites GGATAACAATTTCACAAGTGTGTGAGCGGATAACAAAAGGTGTGAGTTAGCTCACTCCCCTGTGATCTCTGTTACATAGACGTGCGAGGATGAGAACACAATGTGTGTGCTCGGTTTAGTTCACCTGTGACACAGTGCAAACGCGCCTGACGGAGTTCACAAATTGTGAGTGTCTATAATCACGATCGATTTGGAATATCCATCACATGCAAAGGACGTCACGATTTGGGAGCTGGCGACCTGGGTCATGTGTGATGTGTATCGAACCGTGTATTTATTTGAACCACATCGCAGGTGAGAGCCATCACAGGAGTGTGTAAGCTGTGCCACGTTTATTCCATGTCACGAGTGTTGTTATACACATCACTAGTGAAACGTGCTCCCACTCGCATGTGATTCGATTCACA

Page 24: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

Create a Multiple Sequence Alignment GGATAACAATTTCACATGTGAGCGGATAACAATGTGAGTTAGCTCACTTGTGATCTCTGTTACACGAGGATGAGAACACACTCGGTTTAGTTCACCTGTGACACAGTGCAAACCTGACGGAGTTCACAAGTGTCTATAATCACGTGGAATATCCATCACATGCAAAGGACGTCACGGGCGACCTGGGTCATGTGTGATGTGTATCGAATTTGAACCACATCGCAGGTGAGAGCCATCACATGTAAGCTGTGCCACGTTTATTCCATGTCACGTGTTATACACATCACTCGTGCTCCCACTCGCATGTGATTCGATTCACA

Page 25: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

Generate a Logo plot

XXXXXTGTGAXXXXAXTCACAXXXXXXXXXXXXACACTXXXXTXAGTGTXXXXXXX

Page 26: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

• http://weblogo.berkeley.eduWebLogo - Input

Page 27: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

Genes:WebLogo - Outputs

Proteins:

Page 28: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

PROBLEMS…

• When searching for a motif in a genome using PSSM or other methods – the motif is usually found all over the place– The motif is considered real if found in the vicinity of a gene.

• Checking experimentally for the binding sites of a specific TF (location analysis) – the sites that bind the motif are in some cases similar to the PSSM and sometimes not!

Page 29: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

Scenario 2 : Binding targets are unknown

Page 30: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

Finding new Motifs

• We are given a group of genes, which presumably contain a common regulatory motif.

• We know nothing of the TF that binds to the putative motif.

• The problem: discover the motif.

Page 31: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

Motif Discovery

Motif Discovery

Page 32: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

Computational Methods• This problem has received a lot of attention from

CS people.• Methods include:

– Probabilistic methods – hidden Markov models (HMMs), expectation maximization (EM), Gibbs sampling, etc.

– Enumeration methods – problematic for inexact motifs of length k>10. …

• Current status: Problem is still open.

Page 33: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

MEME

"We need a name for the new replicator, a noun that conveys the idea of a unit of cultural transmission, or a unit of imitation. 'Mimeme' comes from a suitable Greek root, but I want a monosyllable that sounds a bit like 'gene'. I hope my classicist friends will forgive me, if I abbreviate mimeme to meme...“Richard Dawkins

Page 34: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

• An (unsupervised) machine learning approach to motif discovery.

• Input: – Set of unaligned sequences.– Possible width of motifs.

• Output:– A set of gapless motifs.– Classifier for each motif.– Alignment of the occurrences of the motif to the

input set.

Timothy L. Bailey and Charles Elkan, "Fitting a mixture model by expectation maximization to discover motifs in biopolymers", Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 28-36, AAAI Press, Menlo Park, California, 1994.

Page 35: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

MEME: Expectation Maximization

• Goal: Find motif profile and positions that have maximum likelihood

• Iteratively estimates a probabilistic model for a random motif to be statistically overrepresented in the dataset. Converges at local optimum.

Page 36: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

MEME result example

Page 37: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

MEME Pros and cons

• The number of motifs or their occurrences are not required in the input.

• Only allows exact matches.

• High time complexity.• Very pessimistic, can

miss signals.

Page 38: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

• DRIM is a tool for discovering short motifs in a ranked list of nucleic acid sequences.

• From a mathematical point of view, DRIM identifies subsequences that tend to appear at the top of the list more often than in the rest of the list. – The definition of TOP in this context is flexible and driven

by the data.

E. Eden, D. Lipson, S. Yogev & Z. Yakhini. Discovering Motifs in Ranked Lists of DNA Sequences, PLoS Computational Biology, 2007.

Page 39: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

The HyperGeometric (HG) score• The HG score estimates the significance of the

intersection (of size b)

N genes

B nb

N all genes, ranked according to some criterionB of them contain the motifn of them are located at the top of the listb contain the motif and are located at the top of the list

Page 40: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

The mHG score• DRIM checks all the possibilities for n, in order to optimize

the significance of the intersection.– It chooses the ni which has the minimal HG score – denoted

as the mHG score.

N genes

B nibi

The mHG score reflects the surprise of seeing the observed density of motif occurrences at the top of the list compared with the rest of the list.

(STILL NEEDS TO BE CORRECTED FOR MULTIPLE HYPOTHESIS)

Page 41: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

Puf2 – an RNA binding proteinYeast 3’UTR sequences were ranked according to Puf2 binding affinity. >YDR222W, affinity = 5.962

ACAAAAGCGUGAACACUUCCACAUGAAAUUCGUUUUUUGUCCUUUUUUUUCUCUUCUUUUUCUCUCCUGUUUCU>YLR297W, affinity = 5.937AAUAAAAAUAGAUAUAAUAGAUGGCACCGCUCUUCACGCCCGAAAGUUGGACAUUUUAAAUUUUAAUUCUCAUGA>YOL109W, affinity = 5.763UCACACUUGAAUGUGCUGCACUUUACUAGAAGUUUCUUUUUCUUUUUUUAAAAAUAAAAAAAGAGGAGAAAAAUGC>YGR138C, affinity = 5.498GCUGGUGCAAGUUUCCGGUAAAAAUAAUGAUGUUCUAGUCAUUCAUAUAUACGAUACAAAAAUAACA>YGL035C, affinity = 5.091UACGCUGACAAGUUUUUGGCGGUGCAGAUAAAUCAAAAGACAAUAGACAAGAAUUAAUAAUAUUAACAAUUAA...

DRIM

(mHG p-value= 9.9 10∙ -49)

Page 42: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

DRIM pros and cons

• Finds relations between ranking variable and motifs (enrichment).

• Returns best possible match without the need of a significance threshold.

• Impossible to build a dictionary for motifs of > ~10-mers.

Page 43: Sequence motifs. What are sequence motifs? Sequences are translated into electron densities with different affinities of interacting with other molecules

Tools on the Web• MEME – Multiple EM for Motif Elicitation.

http://meme.sdsc.edu/meme– metaMEME- Uses HMM method– MAST-Motif Alignment and Search Tool– Etc…

• TRANSFAC - database of eukaryotic cis-acting regulatory DNA elements and trans-acting factors. http://transfac.gbf.de/TRANSFAC/

• eMotif - allows to scan, make and search for motifs at the protein level. http://motif.stanford.edu/emotif/

• DRIM – Finds short motifs enriched in ranked lists.http://bioinfo.cs.technion.ac.il/drim/