consensus sequences: just say no!

Consensus Sequences: Just Say No! Thomas D. Schneider Laboratory of Mathematical Biology National Cancer Institute Frederick, MD 21702. [email protected] http://www-lmmb.ncifcrf.gov/~toms/

1995 December 4

CONSENSUS SEQUENCES

SummaryConsensus sequences are being used to characterize the binding sites of macromolecules on DNA and RNA. After aligning a set of binding site sequences, the most frequent base is chosen. A position which contains 100% A's will be represented by an A, while a position that is only 75% A will also be represented by an A. The consensus is frequently used to search for binding sites, and the number of mismatches to the consensus is counted. A mismatch to a 100% A position is much more severe than one to a 75% A, but this is not accounted for so the researcher is mislead. We present mathematically robust graphical replacements for the consensus sequence called the sequence logo and the walker that do not discard your hard-earned data. Further information and examples may be found on the internet at http://www-lmmb.ncifcrf.gov/~toms/.

Characterize what a binding site looks like ❄ Use Sequence Logos instead Search for new sites ❄ Use Rindividual Matrix Scans instead Investigate how well the bases of a sequence match to functional binding sites ❄ Use the Walker instead

Consensus Sequences

SummaryConsensus sequences are being used to characterize the binding sites of macromolecules on DNA and RNA. After aligning a set of binding site sequences, the most frequent base is chosen. A position which contains 100% A's will be represented by an A, while a position that is only 75% A will also be represented by an A. The consensus is frequently used to search for binding sites, and the number of mismatches to the consensus is counted. A mismatch to a 100% A position is much more severe than one to a 75% A, but this is not accounted for so the researcher is mislead. We present mathematically robust graphical replacements for the consensus sequence called the sequence logo and the walker that do not discard your hard-earned data. Further information and examples may be found on the internet at http://www-lmmb.ncifcrf.gov/~toms/.

Characterize what a binding site looks like ❄ Use Sequence Logos instead Search for new sites ❄ Use Rindividual Matrix Scans instead Investigate how well the bases of a sequence match to functional binding sites ❄ Use the Walker instead

Consensus Sequences

❄ SEQUENCE LOGO --------- +++++++++ 9876543210123456789 ................... 1 GTATCACCGCCAGTGGTAT 2 ATACCACTGGCGGTGATAC 3 TCAACACCGCCAGAGATAA 4 TTATCTCTGGCGGTGTTGA 5 TTATCACCGCAGATGGTTA 6 TAACCATCTGCGGTGATAA 7 CTATCACCGCAAGGGATAA 8 TTATCCCTTGCGGTGATAG 9 CTAACACCGTGCGTGTTGA 10 TCAACACGCACGGTGTTAG 11 TTACCTCTGGCGGTGATAA 12 TTATCACCGCCAGAGGTAA

12 Lambda cI and cro binding sites

0

1

2

bits |

-9 G

ACT

-8 A

CT

-7 A

-6 C

AT

-5 C

-4 C

TA

-3 T

C

-2 G

TC

-1 C

TG

0

TAGC

1

GAC

2

CAG

3

AG

4

GAT

5

G6

TGA

7

T8

TGA

9

TCGA|

raw sequences of binding sites

The height of each stack of letters is the information content (sequence conservation) in bits per base

letter heights are proportional to frequency: 8/12 = 67% A 2/12 = 17% G 1/12 = 8% C 1/12 = 8% T

Minor Groove Faces Protein Major Groove Faces Protein

Zero coordinate for alignment

Error bar for entire stack of letters

5’

3’

exon

intron

exon

donor

acceptor

TGAC

GTCA

CTAG|

CT

GGACTC

TGA

TCGA

CTAG

ACGT

TCAG

GATC

GATC

GACT

GATC

GACT

GACT

GACT

AGCT

GACT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

GACT

GACT

GACT

GACT

ATGCGATC

GTCATAC

G|TCAG

CGAT

These two sequence logos have the same consensus sequence (CA GGT) but different emphasis

How can two binding sites be different but have the same consensus sequence?

❄ SEQUENCE LOGO --------- +++++++++ 9876543210123456789 ................... 1 GTATCACCGCCAGTGGTAT 2 ATACCACTGGCGGTGATAC 3 TCAACACCGCCAGAGATAA 4 TTATCTCTGGCGGTGTTGA 5 TTATCACCGCAGATGGTTA 6 TAACCATCTGCGGTGATAA 7 CTATCACCGCAAGGGATAA 8 TTATCCCTTGCGGTGATAG 9 CTAACACCGTGCGTGTTGA 10 TCAACACGCACGGTGTTAG 11 TTACCTCTGGCGGTGATAA 12 TTATCACCGCCAGAGGTAA

12 Lambda cI and cro binding sites

0

1

2

bits |

-9 G

ACT

-8 A

CT

-7 A

-6 C

AT

-5 C

-4 C

TA

-3 T

C

-2 G

TC

-1 C

TG

0

TAGC

1

GAC

2

CAG

3

AG

4

GAT

5

G6

TGA

7

T8

TGA

9

TCGA|

raw sequences of binding sites

The height of each stack of letters is the information content (sequence conservation) in bits per base

letter heights are proportional to frequency: 8/12 = 67% A 2/12 = 17% G 1/12 = 8% C 1/12 = 8% T

Minor Groove Faces Protein Major Groove Faces Protein

Zero coordinate for alignment

Error bar for entire stack of letters

-400 -300 -200 -100 0 100Position (bases) relative to the E. coli nrd promoter

-30

-25

-20

-15

-10

-5

0

5

10

15

20Ri (bits)

❄ SCANThe individual information weight matrix is put at every position of a sequence. The weights are added together depending on the sequence. This gives the total Rindividual (Ri) at every position in the sequence.

promoter

DnaA sitesFis sites predicted from consensus

Many FIS sites observed in the footprint match the scan

5’

3’

exon

intron

exon

donor

acceptor

TGAC

GTCA

CTAG|

CT

GGACTC

TGA

TCGA

CTAG

ACGT

TCAG

GATC

GATC

GACT

GATC

GACT

GACT

GACT

AGCT

GACT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

AGCT

GACT

GACT

GACT

GACT

ATGCGATC

GTCATAC

G|TCAG

CGAT

These two sequence logos have the same consensus sequence (CA GGT) but different emphasis

How can two binding sites be different but have the same consensus sequence?

Scan of Fis Promoter

promoter

6 Fis sites were predicted on the E. coli Fis promoter from footprints and consensus

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400Position (bases) in the E. coli fis promoter

-30 -25 -20 -15 -10 -5 0 5 10 15

Ri (bits)

Many potential Fis sites besides those already identified on the Fis promoter are observed. The large number near the promoter indicates that Fis controls its own transcription.

-400 -300 -200 -100 0 100Position (bases) relative to the E. coli nrd promoter

-30

-25

-20

-15

-10

-5

0

5

10

15

20Ri (bits)

❄ SCANThe individual information weight matrix is put at every position of a sequence. The weights are added together depending on the sequence. This gives the total Rindividual (Ri) at every position in the sequence.

promoter

DnaA sitesFis sites predicted from consensus

Many FIS sites observed in the footprint match the scan

Scan of Fis Promoter

promoter

6 Fis sites were predicted on the E. coli Fis promoter from footprints and consensus

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400Position (bases) in the E. coli fis promoter

-30 -25 -20 -15 -10 -5 0 5 10 15

Ri (bits)

Many potential Fis sites besides those already identified on the Fis promoter are observed. The large number near the promoter indicates that Fis controls its own transcription.

aa ac tg tt gg ta gt ac ca ct ga178 -5.3 -4.9

t cc ag tc tg tt ta ta gt ag ac gg

aa ac tg tt gg ta gt ac ca ct ga178 -5.3 -4.9t cc ag tc tg tt ta ta gt ag ac gg aa ta tc gg tt gg aa ct cc ga ca179 -6.7 -5.4t at tc tg tc tg tt ga aa at ggcg

aa ta tc gg tt gg aa ct cc ga ca179 -6.7 -5.4

t at tc tg tc tg tt ga aa at ggcg ta ta ga tc gg at cg ca gt cc at180 9.2 0.3

a tt tt tc tg tc gg at aa gatgcg

ta ta ga tc gg at cg ca gt cc at180 9.2 0.3

a tt tt tc tg tc gg at aa gatgcg tt ga ta ga ac cg ct gg ca at tt181 -11.7 -7.2

c ta tt tt tc gg ac ag gtaatgcg

tt ga ta ga ac cg ct gg ca at tt181 -11.7 -7.2

c ta tt tt tc gg ac ag gtaatgcg gt tt ga aa ca cc g

g ct ag ta tg182 -3.9

-4.

4

t tc ta tt gt ac ag gcgtaatgcg gt tt ga aa ca cc g


-4.

4

t tc ta tt gt ac ag gcgtaatgcg

How well do the bases of a sequence match to functional binding sites?

❄ WALKER

5 copies of a single sequence are below:

The Walker is the colored letters.

The weight matrix is for the E. coli Fis protein.

The vertical bar is a scale: the top is at 2 bits the middle is at 0 bits the bottom is at -4 bits

Letters that go up are preferred at that position.

Letters that go down are detrimental at that position (purple goes below -4 bits).

A green vertical bar indicates a binding site, red one means it is probably not a site.

The three numbers on the bar are: ❄ position ❄ Rindividual ❄ standard deviations from mean (Z)

The cosine wave represents the orientation of the DNA facing the protein.

Is a sequence change a Mutation or a Polymorphism?

0

1

2

bits |

-25

GATC

-24

GATC

-23

GACT

-22

GATC

-21

GACT

-20

GACT

-19

GACT

-18

AGCT

-17

GACT

-16

AGCT

-15

AGCT

-14

AGCT

-13

AGCT

-12

AGCT

-11

AGCT

-10

AGCT

-9 A

GCT

-8 A

GCT

-7 G

ACT

-6 G

ACT

-5 G

ACT

-4 G

ACT

-3 ATGC

-2 GA

TC

-1 GTCA

0

TAC

G|1

TCAG

2

CGAT

|A T to C change seen in a splice acceptor of hMSH2 was interpreted to be the mutation which causes familial nonpolyposis colon cancer (Fishel et al., Cell 75:1027-1038, 1993):

The sequence logo shows nearly equal frequencies of bases there.

This is what a strong mutation would look like.

ga gt cg

ttttgtttaatatagtcg

0 6.5 -0.5

cccaata

ga g

t cg

ttttgtttaatatagtcg

0 6.5 -0.5

cccaataga g

t cg

tcttgtttaatatagtcg

0 6.3 -0.6

cccaataga g

t cg

tcttgtttaatatagtcg

0 6.3 -0.6

cccaata

ga g

t cg

tcttgtttaatatagtcg

0 6.3 -0.6

cccaataga g

t cg

ttttgtttaatatagtcg

0 6.5 -0.5

cccaatagcg

t cg

ttttgtttaatatagtcg

0 -0.9 -2.1

cccaatagcg

t cg

ttttgtttaatatagtcg

0 -0.9 -2.1

cccaata

The walker shows it is a polymorphism.

wild-type as seen by a walker

aa ac tg tt gg ta gt ac ca ct ga178 -5.3 -4.9

t cc ag tc tg tt ta ta gt ag ac gg

aa ac tg tt gg ta gt ac ca ct ga178 -5.3 -4.9t cc ag tc tg tt ta ta gt ag ac gg aa ta tc gg tt gg aa ct cc ga ca179 -6.7 -5.4t at tc tg tc tg tt ga aa at ggcg

aa ta tc gg tt gg aa ct cc ga ca179 -6.7 -5.4

t at tc tg tc tg tt ga aa at ggcg ta ta ga tc gg at cg ca gt cc at180 9.2 0.3

a tt tt tc tg tc gg at aa gatgcg

ta ta ga tc gg at cg ca gt cc at180 9.2 0.3

a tt tt tc tg tc gg at aa gatgcg tt ga ta ga ac cg ct gg ca at tt181 -11.7 -7.2

c ta tt tt tc gg ac ag gtaatgcg

tt ga ta ga ac cg ct gg ca at tt181 -11.7 -7.2

c ta tt tt tc gg ac ag gtaatgcg gt tt ga aa ca cc g


-4.

4

t tc ta tt gt ac ag gcgtaatgcg gt tt ga aa ca cc g


-4.

4

t tc ta tt gt ac ag gcgtaatgcg

How well do the bases of a sequence match to functional binding sites?

❄ WALKER

5 copies of a single sequence are below:

The Walker is the colored letters.

The weight matrix is for the E. coli Fis protein.

The vertical bar is a scale: the top is at 2 bits the middle is at 0 bits the bottom is at -4 bits

Letters that go up are preferred at that position.

Letters that go down are detrimental at that position (purple goes below -4 bits).

A green vertical bar indicates a binding site, red one means it is probably not a site.

The three numbers on the bar are: ❄ position ❄ Rindividual ❄ standard deviations from mean (Z)

The cosine wave represents the orientation of the DNA facing the protein.

C2C1C3

O3

O5

O4O2

C4

P1

C5

O1

N1

C2

N6

C6

N3

C4

C5

N9

N7C8

A

O1

C5

P1

C4

O2

O5

O4

O3

C3C1C2

Me

C6

C5

N1

C4

O4

C2

N3

O2

T

C2C1C3

O3

O4

O5

O2

C4

P1

C5

O1

N2

C2

N1

O6

C6

N3

C4

C5N7

N9

C8

G

O1

C5

P1

C4

O2O4

O5

O3

C3C1C2

C6

C5

N1

C4

N4

C2

N3

O2

C

Minor Groove

Major Groove

Minor Groove

Major Groove

da a

da a

a a

a d a

A BIT measures the choice between 2 equally likely possibilities:

To choose one base in 4 requires two bits:

one bit is like a knife slice

two bits are like two knife slices

INFORMATION THEORY

Is a sequence change a Mutation or a Polymorphism?

0

1

2

bits |

-25

GATC

-24

GATC

-23

GACT

-22

GATC

-21

GACT

-20

GACT

-19

GACT

-18

AGCT

-17

GACT

-16

AGCT

-15

AGCT

-14

AGCT

-13

AGCT

-12

AGCT

-11

AGCT

-10

AGCT

-9 A

GCT

-8 A

GCT

-7 G

ACT

-6 G

ACT

-5 G

ACT

-4 G

ACT

-3 ATGC

-2 GA

TC

-1 GTCA

0

TAC

G|1

TCAG

2

CGAT

|A T to C change seen in a splice acceptor of hMSH2 was interpreted to be the mutation which causes familial nonpolyposis colon cancer (Fishel et al., Cell 75:1027-1038, 1993):

The sequence logo shows nearly equal frequencies of bases there.

This is what a strong mutation would look like.

ga gt cg

ttttgtttaatatagtcg

0 6.5 -0.5

cccaata

ga g

t cg

ttttgtttaatatagtcg

0 6.5 -0.5

cccaataga g

t cg

tcttgtttaatatagtcg

0 6.3 -0.6

cccaataga g

t cg

tcttgtttaatatagtcg

0 6.3 -0.6

cccaata

ga g

t cg

tcttgtttaatatagtcg

0 6.3 -0.6

cccaataga g

t cg

ttttgtttaatatagtcg

0 6.5 -0.5

cccaatagcg

t cg

ttttgtttaatatagtcg

0 -0.9 -2.1

cccaatagcg

t cg

ttttgtttaatatagtcg

0 -0.9 -2.1

cccaata

The walker shows it is a polymorphism.

wild-type as seen by a walker

C2C1C3

O3

O5

O4O2

C4

P1

C5

O1

N1

C2

N6

C6

N3

C4

C5

N9

N7C8

A

O1

C5

P1

C4

O2

O5

O4

O3

C3C1C2

Me

C6

C5

N1

C4

O4

C2

N3

O2

T

C2C1C3

O3

O4

O5

O2

C4

P1

C5

O1

N2

C2

N1

O6

C6

N3

C4

C5N7

N9

C8

G

O1

C5

P1

C4

O2O4

O5

O3

C3C1C2

C6

C5

N1

C4

N4

C2

N3

O2

C

Minor Groove

Major Groove

Minor Groove

Major Groove

da a

da a

a a

a d a

A BIT measures the choice between 2 equally likely possibilities:

To choose one base in 4 requires two bits:

one bit is like a knife slice

two bits are like two knife slices

INFORMATION THEORY

Information measured in bits ❄ is additive ❄ is well supported mathematically ❄ measures sequence conservation ❄ is related to the entropy ❄ is calculated as a decrease Êin the uncertainty H: H = -Σ pi log 2 pi (bits)

as R = -∆H (bits/symbol) R is the Rate of information transmission.

Every sequence has an individual information

--------- +++++++++ 9876543210123456789 ................... 1 gtatcaccgccagtggtat 17.7 bits 2 ataccactggcggtgatac 17.7 bits 3 tcaacaccgccagagataa 19.3 bits 4 ttatctctggcggtgttga 19.3 bits 5 ttatcaccgcagatggtta 15.7 bits 6 taaccatctgcggtgataa 15.7 bits 7 ctatcaccgcaagggataa 17.3 bits 8 ttatcccttgcggtgatag 17.3 bits 9 ctaacaccgtgcgtgttga 11.0 bits 10 tcaacacgcacggtgttag 11.0 bits 11 ttacctctggcggtgataa 21.5 bits 12 ttatcaccgccagaggtaa 21.5 bits

Σ = 205 bits Σ /12 = 17.1 bits

The average of the individual information values for the original sequence is the same as the sequence conservation and the area under the sequence logo.

Information measured in bits ❄ is additive ❄ is well supported mathematically ❄ measures sequence conservation ❄ is related to the entropy ❄ is calculated as a decrease Êin the uncertainty H: H = -Σ pi log 2 pi (bits)

as R = -∆H (bits/symbol) R is the Rate of information transmission.

Every sequence has an individual information

--------- +++++++++ 9876543210123456789 ................... 1 gtatcaccgccagtggtat 17.7 bits 2 ataccactggcggtgatac 17.7 bits 3 tcaacaccgccagagataa 19.3 bits 4 ttatctctggcggtgttga 19.3 bits 5 ttatcaccgcagatggtta 15.7 bits 6 taaccatctgcggtgataa 15.7 bits 7 ctatcaccgcaagggataa 17.3 bits 8 ttatcccttgcggtgatag 17.3 bits 9 ctaacaccgtgcgtgttga 11.0 bits 10 tcaacacgcacggtgttag 11.0 bits 11 ttacctctggcggtgataa 21.5 bits 12 ttatcaccgccagaggtaa 21.5 bits

Σ = 205 bits Σ /12 = 17.1 bits

The average of the individual information values for the original sequence is the same as the sequence conservation and the area under the sequence logo.

Individual Information92 FIS binding sites

0

1

2bi

ts |-1

0

CGTA

-9 GC

AT

-8 G

ACT

-7 A

TG

-6

-5 GC

AT

-4 A

GTC

-3 T

CGA

-2 C

GTA

-1 C

GTA

0

GCTA

1

GCAT

2

GCAT

3

GACT

4

TCAG

5

CGTA

6 7

TAC

8

CTGA

9

CGTA

10 GC

AT

|

The area under a sequence logo is the total sequence conservation. The mathematics makes it look like an average. QUESTION: What is it the average of? ANSWER: The information of each individual binding site. This is found from the individual information weight matrix as: Ri(b,l) = 2 + log 2 f(b,l) where f(b,l) is the frequency of each base b at every position l in the binding site.l n(a; l) n(c; l) n(g; l) n(t; l) Ri(a; l) Ri(c; l) Ri(g; l) Ri(t; l)-10 36 7 21 28 0.622843 -1.739727 -0.154765 0.260273-9 20 19 15 38 -0.225154 -0.299154 -0.640191 0.700846-8 18 24 11 39 -0.377157 0.037881 -1.087650 0.738320-7 3 0 85 4 -2.962119 -6.539159 1.862309 -2.547082-6 24 19 29 20 0.037881 -0.299154 0.310899 -0.225154-5 20 18 15 39 -0.225154 -0.377157 -0.640191 0.738320-4 5 40 12 35 -2.225154 0.774846 -0.962119 0.582201-3 73 2 15 2 1.642743 -3.547082 -0.640191 -3.547082-2 53 9 10 20 1.180838 -1.377157 -1.225154 -0.225154-1 40 8 11 33 0.774846 -1.547082 -1.087650 0.4973120 42 4 4 42 0.845235 -2.547082 -2.547082 0.8452351 33 11 8 40 0.497312 -1.087650 -1.547082 0.7748462 20 10 9 53 -0.225154 -1.225154 -1.377157 1.1808383 2 15 2 73 -3.547082 -0.640191 -3.547082 1.6427434 35 12 40 5 0.582201 -0.962119 0.774846 -2.2251545 39 15 18 20 0.738320 -0.640191 -0.377157 -0.2251546 20 29 19 24 -0.225154 0.310899 -0.299154 0.0378817 4 85 0 3 -2.547082 1.862309 -6.539159 -2.9621198 39 11 24 18 0.738320 -1.087650 0.037881 -0.3771579 38 15 19 20 0.700846 -0.640191 -0.299154 -0.22515410 28 21 7 36 0.260273 -0.154765 -1.739727 0.622843

1

number of bases b at each position l

Ri(b,l) weight matrix

Individual Information Distributions

The consensus is an extremely abnormal, strong binding site with probability typically < 10 -5

Zero divides sites from non-sites

Rsequence= mean

numberof sites

consensusanti-consensus

SEM

Ri

0

Ri

sitesnon-sites

The average of the individual information values ... is the area under the sequence logo

Individual Information92 FIS binding sites

0

1

2bi

ts |-1

0

CGTA

-9 GC

AT

-8 G

ACT

-7 A

TG

-6

-5 GC

AT

-4 A

GTC

-3 T

CGA

-2 C

GTA

-1 C

GTA

0

GCTA

1

GCAT

2

GCAT

3

GACT

4

TCAG

5

CGTA

6 7

TAC

8

CTGA

9

CGTA

10 GC

AT

|

The area under a sequence logo is the total sequence conservation. The mathematics makes it look like an average. QUESTION: What is it the average of? ANSWER: The information of each individual binding site. This is found from the individual information weight matrix as: Ri(b,l) = 2 + log 2 f(b,l) where f(b,l) is the frequency of each base b at every position l in the binding site.l n(a; l) n(c; l) n(g; l) n(t; l) Ri(a; l) Ri(c; l) Ri(g; l) Ri(t; l)-10 36 7 21 28 0.622843 -1.739727 -0.154765 0.260273-9 20 19 15 38 -0.225154 -0.299154 -0.640191 0.700846-8 18 24 11 39 -0.377157 0.037881 -1.087650 0.738320-7 3 0 85 4 -2.962119 -6.539159 1.862309 -2.547082-6 24 19 29 20 0.037881 -0.299154 0.310899 -0.225154-5 20 18 15 39 -0.225154 -0.377157 -0.640191 0.738320-4 5 40 12 35 -2.225154 0.774846 -0.962119 0.582201-3 73 2 15 2 1.642743 -3.547082 -0.640191 -3.547082-2 53 9 10 20 1.180838 -1.377157 -1.225154 -0.225154-1 40 8 11 33 0.774846 -1.547082 -1.087650 0.4973120 42 4 4 42 0.845235 -2.547082 -2.547082 0.8452351 33 11 8 40 0.497312 -1.087650 -1.547082 0.7748462 20 10 9 53 -0.225154 -1.225154 -1.377157 1.1808383 2 15 2 73 -3.547082 -0.640191 -3.547082 1.6427434 35 12 40 5 0.582201 -0.962119 0.774846 -2.2251545 39 15 18 20 0.738320 -0.640191 -0.377157 -0.2251546 20 29 19 24 -0.225154 0.310899 -0.299154 0.0378817 4 85 0 3 -2.547082 1.862309 -6.539159 -2.9621198 39 11 24 18 0.738320 -1.087650 0.037881 -0.3771579 38 15 19 20 0.700846 -0.640191 -0.299154 -0.22515410 28 21 7 36 0.260273 -0.154765 -1.739727 0.622843

1

number of bases b at each position l

Ri(b,l) weight matrix

Individual Information Distributions

The consensus is an extremely abnormal, strong binding site with probability typically < 10 -5

Zero divides sites from non-sites

Rsequence= mean

numberof sites

consensusanti-consensus

SEM

Ri

0

Ri

sitesnon-sites

The average of the individual information values ... is the area under the sequence logo

consensus sequences: just say no!

Documents