sequence motifs. motifs motifs represent a short common sequence –regulatory motifs (tf binding...

33
Sequence Motifs

Post on 21-Dec-2015

276 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

Sequence Motifs

Page 2: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

Motifs

• Motifs represent a short common sequence– Regulatory motifs (TF binding sites) – Functional site in proteins (DNA binding motif)

Page 3: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

Regulatory Motifs

• Transcription Factors bind to regulatory motifs – Motifs are 6 – 20 nucleotides long– Activators and repressors– Usually located near target gene, mostly

upstreamTranscription Start Site

SBFmotif

MCM1motif

Gene X

MCM1 SBF

Page 4: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

E. Coli promoter sequences

Page 5: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

DNA binding Motif Zn finger C2H2

Page 6: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

Challenges

• How to recognize a regulatory motif?

• Can we identify new occurrences of known motifs in genome sequences?

• Can we discover new motifs within upstream sequences of genes?

Page 7: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

1. Motif Representation

• Exact motif: CGGATATA• Consensus: represent only

deterministic nucleotides.– Example: HAP1 binding sites in 5

sequences.• consensus motif: CGGNNNTANCGG • N stands for any nucleotide.

• Representing only consensus loses information. How can this be avoided?

CGGATATACCGG

CGGTGATAGCGG

CGGTACTAACGG

CGGCGGTAACGG

CGGCCCTAACGG

------------

CGGNNNTANCGG

Page 8: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

TTGACA

-35

TATAAT

-10

Transcription start site

Representing the motif as a profile

-35 -10

A

T

GC

1 2 3 4 5 6

A

T

GC

1 2 3 4 5 6

Based on ~450 known promoters

0.1 0.1 0.1 0.5 0.2 0.5

0.7 0.7 0.2 0.2 0.2 0.2

0.1 0.1 0.5 0.1 0.1 0.2

0.1 0.1 0.2 0.2 0.5 0.1

0.1 0.7 0.2 0.6 0.5 0.1

0.7 0.1 0.5 0.2 0.2 0.8

0.1 0.1 0.1 0.1 0.1 0.0

0.1 0.1 0.2 0.1 0.1 0.1

Page 9: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

1 2 3 4 5

A 10 25 5 70 60

C 30 25 80 10 15

T 50 25 5 10 5

G 10 25 10 10 20

PSPM – Position Specific Probability Matrix

• Represents a motif of length k (5)• Count the number of occurrence of each nucleotide in

each position

Page 10: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

C 0.3 0.25 0.8 0.1 0.15

T 0.5 0.25 0.05 0.1 0.05

G 0.1 0.25 0.1 0.1 0.2

PSPM – Position Specific Probability Matrix

• Defines Pi{A,C,G,T} for i={1,..,k}.

– Pi (A) – frequency of nucleotide A in position i.

Page 11: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

Graphical Representation – Sequence Logo

• Horizontal axis: position of the base in the sequence.

• Vertical axis: amount of information.

• Letter stack: order indicates importance.

• Letter height: indicates frequency.

• Consensus can be read across the top of the letter columns.

Page 12: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

Identification of Known Motifs within Genomic Sequences

• Motivation: – identification of new genes controlled by the same

TF.– Infer the function of these genes.– enable better understanding of the regulation

mechanism.

Page 13: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

C 0.3 0.25 0.8 0.1 0.15

T 0.5 0.25 0.05 0.1 0.05

G 0.1 0.25 0.1 0.1 0.2

PSPM – Position Specific Probability Matrix

• Each k-mer is assigned a probability. – Example: P(TCCAG)=0.5*0.25*0.8*0.7*0.2

Page 14: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

C 0.3 0.25 0.8 0.1 0.15

T 0.5 0.25 0.05 0.1 0.05

G 0.1 0.25 0.1 0.1 0.2

Detecting a Known Motif within a Sequence using PSPM

• The PSPM is moved along the query sequence.• At each position the sub-sequence is scored for a

match to the PSPM.• Example:

sequence = ATGCAAGTCT…

Page 15: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

• The PSPM is moved along the query sequence.• At each position the sub-sequence is scored for a

match to the PSPM.• Example:

sequence = ATGCAAGTCT…• Position 1: ATGCA

0.1*0.25*0.1*0.1*0.6=1.5*10-4

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

C 0.3 0.25 0.8 0.1 0.15

T 0.5 0.25 0.05 0.1 0.05

G 0.1 0.25 0.1 0.1 0.2

Detecting a Known Motif within a Sequence using PSPM

Page 16: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

• The PSPM is moved along the query sequence.• At each position the sub-sequence is scored for a match to

the PSPM.• Example:

sequence = ATGCAAGTCT…• Position 1: ATGCA

0.1*0.25*0.1*0.1*0.6=1.5*10-4

• Position 2: TGCAA 0.5*0.25*0.8*0.7*0.6=0.042

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

C 0.3 0.25 0.8 0.1 0.15

T 0.5 0.25 0.05 0.1 0.05

G 0.1 0.25 0.1 0.1 0.2

Detecting a Known Motif within a Sequence using PSPM

Page 17: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

Detecting a Known Motif within a Sequence using PSSM

Is it a random match, or is it indeed an occurrence of the motif?

PSPM -> PSSM (Probability Specific Scoring Matrix)– odds score matrix: Oi(n) where n {A,C,G,T} for i={1,..,k}

– defined as Pi(n)/P(n), where P(n) is background frequency.

Oi(n) increases => higher odds that n at position i is part of a real motif.

Page 18: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

1 2 3 4 5

A 0.1 0.25 0.05 0.7 0.6

1 2 3 4 5

A 0.4 1 0.2 2.8 2.4

1 2 3 4 5

A -1.322 0 -2.322 1.485

1.263

PSSM as Odds Score Matrix• Assumption: the background frequency of each

nucleotide is 0.25.

1. Original PSPM (Pi):

2. Odds Matrix (Oi):

3. Going to log scale we get an additive score,Log odds Matrix (log2Oi):

Page 19: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

1 2 3 4 5

A -1.32 0 -2.32 1.48 1.26

C 0.26 0 1.68 -1.32 -0.74

T 1 0 -2.32 -1.32 -2.32

G -1.32 0 -1.32 -1.32 -0.32

Calculating using Log Odds Matrix• Odds 0 implies random match;

Odds > 0 implies real match (?).• Example: sequence = ATGCAAGTCT…• Position 1: ATGCA

-1.32+0-1.32-1.32+1.26=-2.7odds= 2-2.7=0.15

• Position 2: TGCAA1+0+1.68+1.48+1.26 =5.42odds=25.42=42.8

Page 20: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

Calculating the probability of a Match

ATGCAAG

• Position 1 ATGCA = 0.15

Page 21: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

Calculating the probability of a Match

ATGCAAG

• Position 1 ATGCA = 0.15

• Position 2 TGCAA = 42.3

Page 22: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

Calculating the probability of a Match

ATGCAAG

• Position 1 ATGCA = 0.15

• Position 2 TGCAA = 42.3

• Position 3 GCAAG =0.18

Page 23: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

Calculating the probability of a match

ATGCAAG

• Position 1 ATGCA = 0.15

• Position 2 TGCAA = 42.3

• Position 3 GCAAG =0.18

P (i) = S / (∑ S)Example 0.15 /(.15+42.8+.18)=0.003

P (1)= 0.003P (2)= 0.993P (3) =0.004

Page 24: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

Building a PSSM

• Collect all known sequences that bind a certain TF.

• Align all sequences (using multiple sequence alignment).

• Compute the frequency of each nucleotide in each position (PSPM).

• Incorporate background frequency for each nucleotide (PSSM).

Page 25: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

Finding new Motifs

• We are given a group of genes, which presumably contain a common regulatory motif.

• We know nothing of the TF that binds to the putative motif.

• The problem: discover the motif.

Page 26: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

Example

Predicting the cAMP Receptor Protein (CRP) binding site motif

Page 27: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

Extract experimentally defined CRP Binding Sites GGATAACAATTTCACAAGTGTGTGAGCGGATAACAAAAGGTGTGAGTTAGCTCACTCCCCTGTGATCTCTGTTACATAGACGTGCGAGGATGAGAACACAATGTGTGTGCTCGGTTTAGTTCACCTGTGACACAGTGCAAACGCGCCTGACGGAGTTCACAAATTGTGAGTGTCTATAATCACGATCGATTTGGAATATCCATCACATGCAAAGGACGTCACGATTTGGGAGCTGGCGACCTGGGTCATGTGTGATGTGTATCGAACCGTGTATTTATTTGAACCACATCGCAGGTGAGAGCCATCACAGGAGTGTGTAAGCTGTGCCACGTTTATTCCATGTCACGAGTGTTGTTATACACATCACTAGTGAAACGTGCTCCCACTCGCATGTGATTCGATTCACA

Page 28: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

Create a Multiple Sequence Alignment GGATAACAATTTCACATGTGAGCGGATAACAATGTGAGTTAGCTCACTTGTGATCTCTGTTACACGAGGATGAGAACACACTCGGTTTAGTTCACCTGTGACACAGTGCAAACCTGACGGAGTTCACAAGTGTCTATAATCACGTGGAATATCCATCACATGCAAAGGACGTCACGGGCGACCTGGGTCATGTGTGATGTGTATCGAATTTGAACCACATCGCAGGTGAGAGCCATCACATGTAAGCTGTGCCACGTTTATTCCATGTCACGTGTTATACACATCACTCGTGCTCCCACTCGCATGTGATTCGATTCACA

Page 29: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

A C G T

1 -0.43 0.1 -0.46 0.55

2 1.37 0.12 -1.59 -11.2

3 1.69 -1.28 -11.2 -1.43

4 -1.28 0.12 -11.2 1.32

5 0.91 -11.2 -0.46 0.47

6 1.53 -1.38 -1.48 -1.43

7 0.9 -0.48 -11.2 0.12

8 -1.37 -1.28 -11.2 1.68

9 -11.2 -11.2 1.73 -0.56

10 -11.2 -0.51 -11.2 1.72

11 -0.48 -11.2 1.72 -11.2

12 1.56 -1.59 -11.2 -0.46

13 -0.51 -0.38 -0.55 0.88

14 -11.2 0.5 0.57 0.13

15 0.17 -0.51 0.12 0.12

16 0.9 -11.2 0.5 -0.48

17 0.17 0.16 0.06 -0.48

18 -0.4 -0.38 0.82 -0.48

19 -1.38 -1.28 -11.2 1.68

20 -1.48 1.7 -11.2 -1.38

21 1.5 -1.38 -1.43 -1.28

Generate a PSSM

Page 30: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

XXXXXTGTGAXXXXAXTCACAXXXXXXXXXXXXACACTXXXXTXGATGTXXXXXXX

Page 31: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

PROBLEMS…

• When searching for a motif in a genome using PSSM or other methods – the motif is usually found all over the place

->The motif is considered real if found in the vicinity of a gene.

• Checking experimentally for the binding sites of a specific TF (location analysis) – the sites that bind the motif are in some cases similar to the PSSM and sometimes not!

Page 32: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

Computational Methods• This problem has received a lot of attention from

CS people.• Methods include:

– Probabilistic methods – hidden Markov models (HMMs), expectation maximization (EM), Gibbs sampling, etc.

– Enumeration methods – problematic for inexact motifs of length k>10. …

• Current status: Problem is still open.

Page 33: Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

Tools on the Web• MEME – Multiple EM for Motif Elicitation.

http://meme.sdsc.edu/meme/website/• metaMEME- Uses HMM method

http://meme.sdsc.edu/meme• MAST-Motif Alignment and Search Tool

http://meme.sdsc.edu/meme

• TRANSFAC - database of eukaryotic cis-acting regulatory DNA elements and trans-acting factors. http://transfac.gbf.de/TRANSFAC/

• eMotif - allows to scan, make and search for motifs in the protein level. http://motif.stanford.edu/emotif/