consensus sequences: just say no!
TRANSCRIPT
Consensus Sequences: Just Say No! Thomas D. Schneider Laboratory of Mathematical Biology National Cancer Institute Frederick, MD 21702. [email protected] http://www-lmmb.ncifcrf.gov/~toms/
1995 December 4
CONSENSUS SEQUENCES
CONSENSUS SEQUENCES
SummaryConsensus sequences are being used to characterize the binding sites of macromolecules on DNA and RNA. After aligning a set of binding site sequences, the most frequent base is chosen. A position which contains 100% A's will be represented by an A, while a position that is only 75% A will also be represented by an A. The consensus is frequently used to search for binding sites, and the number of mismatches to the consensus is counted. A mismatch to a 100% A position is much more severe than one to a 75% A, but this is not accounted for so the researcher is mislead. We present mathematically robust graphical replacements for the consensus sequence called the sequence logo and the walker that do not discard your hard-earned data. Further information and examples may be found on the internet at http://www-lmmb.ncifcrf.gov/~toms/.
Characterize what a binding site looks like ❄ Use Sequence Logos instead Search for new sites ❄ Use Rindividual Matrix Scans instead Investigate how well the bases of a sequence match to functional binding sites ❄ Use the Walker instead
Consensus Sequences
SummaryConsensus sequences are being used to characterize the binding sites of macromolecules on DNA and RNA. After aligning a set of binding site sequences, the most frequent base is chosen. A position which contains 100% A's will be represented by an A, while a position that is only 75% A will also be represented by an A. The consensus is frequently used to search for binding sites, and the number of mismatches to the consensus is counted. A mismatch to a 100% A position is much more severe than one to a 75% A, but this is not accounted for so the researcher is mislead. We present mathematically robust graphical replacements for the consensus sequence called the sequence logo and the walker that do not discard your hard-earned data. Further information and examples may be found on the internet at http://www-lmmb.ncifcrf.gov/~toms/.
Characterize what a binding site looks like ❄ Use Sequence Logos instead Search for new sites ❄ Use Rindividual Matrix Scans instead Investigate how well the bases of a sequence match to functional binding sites ❄ Use the Walker instead
Consensus Sequences
❄ SEQUENCE LOGO --------- +++++++++ 9876543210123456789 ................... 1 GTATCACCGCCAGTGGTAT 2 ATACCACTGGCGGTGATAC 3 TCAACACCGCCAGAGATAA 4 TTATCTCTGGCGGTGTTGA 5 TTATCACCGCAGATGGTTA 6 TAACCATCTGCGGTGATAA 7 CTATCACCGCAAGGGATAA 8 TTATCCCTTGCGGTGATAG 9 CTAACACCGTGCGTGTTGA 10 TCAACACGCACGGTGTTAG 11 TTACCTCTGGCGGTGATAA 12 TTATCACCGCCAGAGGTAA
12 Lambda cI and cro binding sites
0
1
2
bits |
-9 G
ACT
-8 A
CT
-7 A
-6 C
AT
-5 C
-4 C
TA
-3 T
C
-2 G
TC
-1 C
TG
0
TAGC
1
GAC
2
CAG
3
AG
4
GAT
5
G6
TGA
7
T8
TGA
9
TCGA|
raw sequences of binding sites
The height of each stack of letters is the information content (sequence conservation) in bits per base
letter heights are proportional to frequency: 8/12 = 67% A 2/12 = 17% G 1/12 = 8% C 1/12 = 8% T
Minor Groove Faces Protein Major Groove Faces Protein
Zero coordinate for alignment
Error bar for entire stack of letters
5’
3’
exon
intron
exon
donor
acceptor
TGAC
GTCA
CTAG|
CT
GGACTC
TGA
TCGA
CTAG
ACGT
TCAG
GATC
GATC
GACT
GATC
GACT
GACT
GACT
AGCT
GACT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
GACT
GACT
GACT
GACT
ATGCGATC
GTCATAC
G|TCAG
CGAT
These two sequence logos have the same consensus sequence (CA GGT) but different emphasis
How can two binding sites be different but have the same consensus sequence?
❄ SEQUENCE LOGO --------- +++++++++ 9876543210123456789 ................... 1 GTATCACCGCCAGTGGTAT 2 ATACCACTGGCGGTGATAC 3 TCAACACCGCCAGAGATAA 4 TTATCTCTGGCGGTGTTGA 5 TTATCACCGCAGATGGTTA 6 TAACCATCTGCGGTGATAA 7 CTATCACCGCAAGGGATAA 8 TTATCCCTTGCGGTGATAG 9 CTAACACCGTGCGTGTTGA 10 TCAACACGCACGGTGTTAG 11 TTACCTCTGGCGGTGATAA 12 TTATCACCGCCAGAGGTAA
12 Lambda cI and cro binding sites
0
1
2
bits |
-9 G
ACT
-8 A
CT
-7 A
-6 C
AT
-5 C
-4 C
TA
-3 T
C
-2 G
TC
-1 C
TG
0
TAGC
1
GAC
2
CAG
3
AG
4
GAT
5
G6
TGA
7
T8
TGA
9
TCGA|
raw sequences of binding sites
The height of each stack of letters is the information content (sequence conservation) in bits per base
letter heights are proportional to frequency: 8/12 = 67% A 2/12 = 17% G 1/12 = 8% C 1/12 = 8% T
Minor Groove Faces Protein Major Groove Faces Protein
Zero coordinate for alignment
Error bar for entire stack of letters
-400 -300 -200 -100 0 100Position (bases) relative to the E. coli nrd promoter
-30
-25
-20
-15
-10
-5
0
5
10
15
20Ri (bits)
❄ SCANThe individual information weight matrix is put at every position of a sequence. The weights are added together depending on the sequence. This gives the total Rindividual (Ri) at every position in the sequence.
promoter
DnaA sitesFis sites predicted from consensus
Many FIS sites observed in the footprint match the scan
5’
3’
exon
intron
exon
donor
acceptor
TGAC
GTCA
CTAG|
CT
GGACTC
TGA
TCGA
CTAG
ACGT
TCAG
GATC
GATC
GACT
GATC
GACT
GACT
GACT
AGCT
GACT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
AGCT
GACT
GACT
GACT
GACT
ATGCGATC
GTCATAC
G|TCAG
CGAT
These two sequence logos have the same consensus sequence (CA GGT) but different emphasis
How can two binding sites be different but have the same consensus sequence?
Scan of Fis Promoter
promoter
6 Fis sites were predicted on the E. coli Fis promoter from footprints and consensus
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400Position (bases) in the E. coli fis promoter
-30 -25 -20 -15 -10 -5 0 5 10 15
Ri (bits)
Many potential Fis sites besides those already identified on the Fis promoter are observed. The large number near the promoter indicates that Fis controls its own transcription.
-400 -300 -200 -100 0 100Position (bases) relative to the E. coli nrd promoter
-30
-25
-20
-15
-10
-5
0
5
10
15
20Ri (bits)
❄ SCANThe individual information weight matrix is put at every position of a sequence. The weights are added together depending on the sequence. This gives the total Rindividual (Ri) at every position in the sequence.
promoter
DnaA sitesFis sites predicted from consensus
Many FIS sites observed in the footprint match the scan
Scan of Fis Promoter
promoter
6 Fis sites were predicted on the E. coli Fis promoter from footprints and consensus
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400Position (bases) in the E. coli fis promoter
-30 -25 -20 -15 -10 -5 0 5 10 15
Ri (bits)
Many potential Fis sites besides those already identified on the Fis promoter are observed. The large number near the promoter indicates that Fis controls its own transcription.
aa ac tg tt gg ta gt ac ca ct ga178 -5.3 -4.9
t cc ag tc tg tt ta ta gt ag ac gg
aa ac tg tt gg ta gt ac ca ct ga178 -5.3 -4.9t cc ag tc tg tt ta ta gt ag ac gg aa ta tc gg tt gg aa ct cc ga ca179 -6.7 -5.4t at tc tg tc tg tt ga aa at ggcg
aa ta tc gg tt gg aa ct cc ga ca179 -6.7 -5.4
t at tc tg tc tg tt ga aa at ggcg ta ta ga tc gg at cg ca gt cc at180 9.2 0.3
a tt tt tc tg tc gg at aa gatgcg
ta ta ga tc gg at cg ca gt cc at180 9.2 0.3
a tt tt tc tg tc gg at aa gatgcg tt ga ta ga ac cg ct gg ca at tt181 -11.7 -7.2
c ta tt tt tc gg ac ag gtaatgcg
tt ga ta ga ac cg ct gg ca at tt181 -11.7 -7.2
c ta tt tt tc gg ac ag gtaatgcg gt tt ga aa ca cc g
g ct ag ta tg182 -3.9
-4.
4
t tc ta tt gt ac ag gcgtaatgcg gt tt ga aa ca cc g
g ct ag ta tg182 -3.9
-4.
4
t tc ta tt gt ac ag gcgtaatgcg
How well do the bases of a sequence match to functional binding sites?
❄ WALKER
5 copies of a single sequence are below:
The Walker is the colored letters.
The weight matrix is for the E. coli Fis protein.
The vertical bar is a scale: the top is at 2 bits the middle is at 0 bits the bottom is at -4 bits
Letters that go up are preferred at that position.
Letters that go down are detrimental at that position (purple goes below -4 bits).
A green vertical bar indicates a binding site, red one means it is probably not a site.
The three numbers on the bar are: ❄ position ❄ Rindividual ❄ standard deviations from mean (Z)
The cosine wave represents the orientation of the DNA facing the protein.
Is a sequence change a Mutation or a Polymorphism?
0
1
2
bits |
-25
GATC
-24
GATC
-23
GACT
-22
GATC
-21
GACT
-20
GACT
-19
GACT
-18
AGCT
-17
GACT
-16
AGCT
-15
AGCT
-14
AGCT
-13
AGCT
-12
AGCT
-11
AGCT
-10
AGCT
-9 A
GCT
-8 A
GCT
-7 G
ACT
-6 G
ACT
-5 G
ACT
-4 G
ACT
-3 ATGC
-2 GA
TC
-1 GTCA
0
TAC
G|1
TCAG
2
CGAT
|A T to C change seen in a splice acceptor of hMSH2 was interpreted to be the mutation which causes familial nonpolyposis colon cancer (Fishel et al., Cell 75:1027-1038, 1993):
The sequence logo shows nearly equal frequencies of bases there.
This is what a strong mutation would look like.
ga gt cg
ttttgtttaatatagtcg
0 6.5 -0.5
cccaata
ga g
t cg
ttttgtttaatatagtcg
0 6.5 -0.5
cccaataga g
t cg
tcttgtttaatatagtcg
0 6.3 -0.6
cccaataga g
t cg
tcttgtttaatatagtcg
0 6.3 -0.6
cccaata
ga g
t cg
tcttgtttaatatagtcg
0 6.3 -0.6
cccaataga g
t cg
ttttgtttaatatagtcg
0 6.5 -0.5
cccaatagcg
t cg
ttttgtttaatatagtcg
0 -0.9 -2.1
cccaatagcg
t cg
ttttgtttaatatagtcg
0 -0.9 -2.1
cccaata
The walker shows it is a polymorphism.
wild-type as seen by a walker
aa ac tg tt gg ta gt ac ca ct ga178 -5.3 -4.9
t cc ag tc tg tt ta ta gt ag ac gg
aa ac tg tt gg ta gt ac ca ct ga178 -5.3 -4.9t cc ag tc tg tt ta ta gt ag ac gg aa ta tc gg tt gg aa ct cc ga ca179 -6.7 -5.4t at tc tg tc tg tt ga aa at ggcg
aa ta tc gg tt gg aa ct cc ga ca179 -6.7 -5.4
t at tc tg tc tg tt ga aa at ggcg ta ta ga tc gg at cg ca gt cc at180 9.2 0.3
a tt tt tc tg tc gg at aa gatgcg
ta ta ga tc gg at cg ca gt cc at180 9.2 0.3
a tt tt tc tg tc gg at aa gatgcg tt ga ta ga ac cg ct gg ca at tt181 -11.7 -7.2
c ta tt tt tc gg ac ag gtaatgcg
tt ga ta ga ac cg ct gg ca at tt181 -11.7 -7.2
c ta tt tt tc gg ac ag gtaatgcg gt tt ga aa ca cc g
g ct ag ta tg182 -3.9
-4.
4
t tc ta tt gt ac ag gcgtaatgcg gt tt ga aa ca cc g
g ct ag ta tg182 -3.9
-4.
4
t tc ta tt gt ac ag gcgtaatgcg
How well do the bases of a sequence match to functional binding sites?
❄ WALKER
5 copies of a single sequence are below:
The Walker is the colored letters.
The weight matrix is for the E. coli Fis protein.
The vertical bar is a scale: the top is at 2 bits the middle is at 0 bits the bottom is at -4 bits
Letters that go up are preferred at that position.
Letters that go down are detrimental at that position (purple goes below -4 bits).
A green vertical bar indicates a binding site, red one means it is probably not a site.
The three numbers on the bar are: ❄ position ❄ Rindividual ❄ standard deviations from mean (Z)
The cosine wave represents the orientation of the DNA facing the protein.
C2C1C3
O3
O5
O4O2
C4
P1
C5
O1
N1
C2
N6
C6
N3
C4
C5
N9
N7C8
A
O1
C5
P1
C4
O2
O5
O4
O3
C3C1C2
Me
C6
C5
N1
C4
O4
C2
N3
O2
T
C2C1C3
O3
O4
O5
O2
C4
P1
C5
O1
N2
C2
N1
O6
C6
N3
C4
C5N7
N9
C8
G
O1
C5
P1
C4
O2O4
O5
O3
C3C1C2
C6
C5
N1
C4
N4
C2
N3
O2
C
Minor Groove
Major Groove
Minor Groove
Major Groove
da a
da a
a a
a d a
A BIT measures the choice between 2 equally likely possibilities:
To choose one base in 4 requires two bits:
one bit is like a knife slice
two bits are like two knife slices
INFORMATION THEORY
Is a sequence change a Mutation or a Polymorphism?
0
1
2
bits |
-25
GATC
-24
GATC
-23
GACT
-22
GATC
-21
GACT
-20
GACT
-19
GACT
-18
AGCT
-17
GACT
-16
AGCT
-15
AGCT
-14
AGCT
-13
AGCT
-12
AGCT
-11
AGCT
-10
AGCT
-9 A
GCT
-8 A
GCT
-7 G
ACT
-6 G
ACT
-5 G
ACT
-4 G
ACT
-3 ATGC
-2 GA
TC
-1 GTCA
0
TAC
G|1
TCAG
2
CGAT
|A T to C change seen in a splice acceptor of hMSH2 was interpreted to be the mutation which causes familial nonpolyposis colon cancer (Fishel et al., Cell 75:1027-1038, 1993):
The sequence logo shows nearly equal frequencies of bases there.
This is what a strong mutation would look like.
ga gt cg
ttttgtttaatatagtcg
0 6.5 -0.5
cccaata
ga g
t cg
ttttgtttaatatagtcg
0 6.5 -0.5
cccaataga g
t cg
tcttgtttaatatagtcg
0 6.3 -0.6
cccaataga g
t cg
tcttgtttaatatagtcg
0 6.3 -0.6
cccaata
ga g
t cg
tcttgtttaatatagtcg
0 6.3 -0.6
cccaataga g
t cg
ttttgtttaatatagtcg
0 6.5 -0.5
cccaatagcg
t cg
ttttgtttaatatagtcg
0 -0.9 -2.1
cccaatagcg
t cg
ttttgtttaatatagtcg
0 -0.9 -2.1
cccaata
The walker shows it is a polymorphism.
wild-type as seen by a walker
C2C1C3
O3
O5
O4O2
C4
P1
C5
O1
N1
C2
N6
C6
N3
C4
C5
N9
N7C8
A
O1
C5
P1
C4
O2
O5
O4
O3
C3C1C2
Me
C6
C5
N1
C4
O4
C2
N3
O2
T
C2C1C3
O3
O4
O5
O2
C4
P1
C5
O1
N2
C2
N1
O6
C6
N3
C4
C5N7
N9
C8
G
O1
C5
P1
C4
O2O4
O5
O3
C3C1C2
C6
C5
N1
C4
N4
C2
N3
O2
C
Minor Groove
Major Groove
Minor Groove
Major Groove
da a
da a
a a
a d a
A BIT measures the choice between 2 equally likely possibilities:
To choose one base in 4 requires two bits:
one bit is like a knife slice
two bits are like two knife slices
INFORMATION THEORY
Information measured in bits ❄ is additive ❄ is well supported mathematically ❄ measures sequence conservation ❄ is related to the entropy ❄ is calculated as a decrease Êin the uncertainty H: H = -Σ pi log 2 pi (bits)
as R = -∆H (bits/symbol) R is the Rate of information transmission.
Every sequence has an individual information
--------- +++++++++ 9876543210123456789 ................... 1 gtatcaccgccagtggtat 17.7 bits 2 ataccactggcggtgatac 17.7 bits 3 tcaacaccgccagagataa 19.3 bits 4 ttatctctggcggtgttga 19.3 bits 5 ttatcaccgcagatggtta 15.7 bits 6 taaccatctgcggtgataa 15.7 bits 7 ctatcaccgcaagggataa 17.3 bits 8 ttatcccttgcggtgatag 17.3 bits 9 ctaacaccgtgcgtgttga 11.0 bits 10 tcaacacgcacggtgttag 11.0 bits 11 ttacctctggcggtgataa 21.5 bits 12 ttatcaccgccagaggtaa 21.5 bits
Σ = 205 bits Σ /12 = 17.1 bits
The average of the individual information values for the original sequence is the same as the sequence conservation and the area under the sequence logo.
Information measured in bits ❄ is additive ❄ is well supported mathematically ❄ measures sequence conservation ❄ is related to the entropy ❄ is calculated as a decrease Êin the uncertainty H: H = -Σ pi log 2 pi (bits)
as R = -∆H (bits/symbol) R is the Rate of information transmission.
Every sequence has an individual information
--------- +++++++++ 9876543210123456789 ................... 1 gtatcaccgccagtggtat 17.7 bits 2 ataccactggcggtgatac 17.7 bits 3 tcaacaccgccagagataa 19.3 bits 4 ttatctctggcggtgttga 19.3 bits 5 ttatcaccgcagatggtta 15.7 bits 6 taaccatctgcggtgataa 15.7 bits 7 ctatcaccgcaagggataa 17.3 bits 8 ttatcccttgcggtgatag 17.3 bits 9 ctaacaccgtgcgtgttga 11.0 bits 10 tcaacacgcacggtgttag 11.0 bits 11 ttacctctggcggtgataa 21.5 bits 12 ttatcaccgccagaggtaa 21.5 bits
Σ = 205 bits Σ /12 = 17.1 bits
The average of the individual information values for the original sequence is the same as the sequence conservation and the area under the sequence logo.
Individual Information92 FIS binding sites
0
1
2bi
ts |-1
0
CGTA
-9 GC
AT
-8 G
ACT
-7 A
TG
-6
-5 GC
AT
-4 A
GTC
-3 T
CGA
-2 C
GTA
-1 C
GTA
0
GCTA
1
GCAT
2
GCAT
3
GACT
4
TCAG
5
CGTA
6 7
TAC
8
CTGA
9
CGTA
10 GC
AT
|
The area under a sequence logo is the total sequence conservation. The mathematics makes it look like an average. QUESTION: What is it the average of? ANSWER: The information of each individual binding site. This is found from the individual information weight matrix as: Ri(b,l) = 2 + log 2 f(b,l) where f(b,l) is the frequency of each base b at every position l in the binding site.l n(a; l) n(c; l) n(g; l) n(t; l) Ri(a; l) Ri(c; l) Ri(g; l) Ri(t; l)-10 36 7 21 28 0.622843 -1.739727 -0.154765 0.260273-9 20 19 15 38 -0.225154 -0.299154 -0.640191 0.700846-8 18 24 11 39 -0.377157 0.037881 -1.087650 0.738320-7 3 0 85 4 -2.962119 -6.539159 1.862309 -2.547082-6 24 19 29 20 0.037881 -0.299154 0.310899 -0.225154-5 20 18 15 39 -0.225154 -0.377157 -0.640191 0.738320-4 5 40 12 35 -2.225154 0.774846 -0.962119 0.582201-3 73 2 15 2 1.642743 -3.547082 -0.640191 -3.547082-2 53 9 10 20 1.180838 -1.377157 -1.225154 -0.225154-1 40 8 11 33 0.774846 -1.547082 -1.087650 0.4973120 42 4 4 42 0.845235 -2.547082 -2.547082 0.8452351 33 11 8 40 0.497312 -1.087650 -1.547082 0.7748462 20 10 9 53 -0.225154 -1.225154 -1.377157 1.1808383 2 15 2 73 -3.547082 -0.640191 -3.547082 1.6427434 35 12 40 5 0.582201 -0.962119 0.774846 -2.2251545 39 15 18 20 0.738320 -0.640191 -0.377157 -0.2251546 20 29 19 24 -0.225154 0.310899 -0.299154 0.0378817 4 85 0 3 -2.547082 1.862309 -6.539159 -2.9621198 39 11 24 18 0.738320 -1.087650 0.037881 -0.3771579 38 15 19 20 0.700846 -0.640191 -0.299154 -0.22515410 28 21 7 36 0.260273 -0.154765 -1.739727 0.622843
1
number of bases b at each position l
Ri(b,l) weight matrix
Individual Information Distributions
The consensus is an extremely abnormal, strong binding site with probability typically < 10 -5
Zero divides sites from non-sites
Rsequence= mean
numberof sites
consensusanti-consensus
SEM
Ri
0
Ri
sitesnon-sites
The average of the individual information values ... is the area under the sequence logo
Individual Information92 FIS binding sites
0
1
2bi
ts |-1
0
CGTA
-9 GC
AT
-8 G
ACT
-7 A
TG
-6
-5 GC
AT
-4 A
GTC
-3 T
CGA
-2 C
GTA
-1 C
GTA
0
GCTA
1
GCAT
2
GCAT
3
GACT
4
TCAG
5
CGTA
6 7
TAC
8
CTGA
9
CGTA
10 GC
AT
|
The area under a sequence logo is the total sequence conservation. The mathematics makes it look like an average. QUESTION: What is it the average of? ANSWER: The information of each individual binding site. This is found from the individual information weight matrix as: Ri(b,l) = 2 + log 2 f(b,l) where f(b,l) is the frequency of each base b at every position l in the binding site.l n(a; l) n(c; l) n(g; l) n(t; l) Ri(a; l) Ri(c; l) Ri(g; l) Ri(t; l)-10 36 7 21 28 0.622843 -1.739727 -0.154765 0.260273-9 20 19 15 38 -0.225154 -0.299154 -0.640191 0.700846-8 18 24 11 39 -0.377157 0.037881 -1.087650 0.738320-7 3 0 85 4 -2.962119 -6.539159 1.862309 -2.547082-6 24 19 29 20 0.037881 -0.299154 0.310899 -0.225154-5 20 18 15 39 -0.225154 -0.377157 -0.640191 0.738320-4 5 40 12 35 -2.225154 0.774846 -0.962119 0.582201-3 73 2 15 2 1.642743 -3.547082 -0.640191 -3.547082-2 53 9 10 20 1.180838 -1.377157 -1.225154 -0.225154-1 40 8 11 33 0.774846 -1.547082 -1.087650 0.4973120 42 4 4 42 0.845235 -2.547082 -2.547082 0.8452351 33 11 8 40 0.497312 -1.087650 -1.547082 0.7748462 20 10 9 53 -0.225154 -1.225154 -1.377157 1.1808383 2 15 2 73 -3.547082 -0.640191 -3.547082 1.6427434 35 12 40 5 0.582201 -0.962119 0.774846 -2.2251545 39 15 18 20 0.738320 -0.640191 -0.377157 -0.2251546 20 29 19 24 -0.225154 0.310899 -0.299154 0.0378817 4 85 0 3 -2.547082 1.862309 -6.539159 -2.9621198 39 11 24 18 0.738320 -1.087650 0.037881 -0.3771579 38 15 19 20 0.700846 -0.640191 -0.299154 -0.22515410 28 21 7 36 0.260273 -0.154765 -1.739727 0.622843
1
number of bases b at each position l
Ri(b,l) weight matrix
Individual Information Distributions
The consensus is an extremely abnormal, strong binding site with probability typically < 10 -5
Zero divides sites from non-sites
Rsequence= mean
numberof sites
consensusanti-consensus
SEM
Ri
0
Ri
sitesnon-sites
The average of the individual information values ... is the area under the sequence logo