entropy & information content by thomas nordahl petersen
TRANSCRIPT
Entropy, Information contents &Logo plots
By Thomas Nordahl Petersen
GTTCTTCGTGTTTATTTTTAGGAAATTGATGATTGTTTCTCCTTTTAAAATAGTACTGCTGTTTTTTACTAACGACACATTGAAGAAATCACTTTGGATACGCTTACCGTTATCCAGAGCTACAGCGCTACTAATATGTAATACTTCAGCTCCCCTTAATATTGAGATCTTTTTTAACTAGTTAGGTCTACCTTCTCCCCTTCTTCATTTTAGCCTGTTTGGACTAACATAACTTATTTACATAGTGCCATTGAACGATATTTCCCGTTGTGTTAAGGCTGAGAAGAATTTTCCCGACCATCAAGACAGGTGATTTATCATGCAAAAACTTTTTTTCACAGGGCTAACTTGCGTTTATTGTGTTTCCACTCAGTTAAAAAACGAAACGTACTTTAATATTTATAGTACTTCATTCGAACATGCTATTTTTCATACAGCAACCTCACATCTGCACTCATCATTAGATTAGAGGAACATGGATACTTTTCTTTATCTAAGCAGCTAACTCAACTATCAACATGCTATTGAACTAGAGATCCACCTATAACTAACATGACTTTAACAGGGCTAATTTACAGTACTAACTAATTAACTTAGAACATTAACATGATCACCGTCACATTTATTAGAATTTCAAACGCAGTGGAATTTTTTTTTCTAGAAATGGTATCGCTCTATGACCAATAAAAACAGACTGTACTTTCAAATGGTATTATTTATAACAGTTGAACATTTCATAAATATGCGATCAATATAGACCGTTGATATATTTTACTTTTTTTTTTTTAGGAGCTCCAAGAATTTATTTCCTTATAATACAGACACGGTTACATCGCAATTAATTTTCTAATAGTTTTTCATTTTGACCATCTTTCTTTTCCCCAGTGCTAAACACGAACCTTCTTTCTCATTCGTAGATTACTGTTGCAATTACTAACAGCTGTAATAGCCGACAAATTTCTCTCTGCGCGTCCAATTTAGCTATACTGTTGTTGTTTTGTTTTGTCGTACAGTGTTTGGAGAAAAACTTCCATTTCTTACATAGATCATCGCCATTCCTTTCCATAATTTATTCAGCGCTTTGGTATCGATTTACTATTTCCATTTAGACGTTGTTCAAAATTTACTAACAATACTTCAGTTTATAATGGATCCTATACTAACAATTTGTAGTTCATAAATAA
• Mutiple alignment of acceptor sites from 268 yeast DNA sequences– What is the biological signal around the site ?– What are the important positions– How can it be visualized ?
Biological information
Sequence-logo
• Logo plot with Information Content
Exon Intron Exon
Entropy - Definition
• Entropy of random variable is a measure of the uncertainty
• In Thermodynamics G=H-TS– The entropy S of a system is the degree of disorder
Entropy - Definition
• Entropy of a distribution of amino acids– The Shannon entropy:
H(p) = - a pa log2(pa), where p is an amino acid distribution.
H(p) is measured in bits: log2(2) = 1, log2(4)=2
Mutiple alignment of 3 sequencesSeq1: A L P KSeq2: A V P RSeq3: A I K R
High entropy - high disorderLow entropy - low disorder
Entropy - example
H(p) = - a pa log2(pa)
Mutiple alignment of 3 sequencesSeq1: A L RSeq2: A V RSeq3: A I K
Pos1: H(p)= -[1*log2(1)] = 0
Pos2: H(p)= -[1/3*log2(1/3)+ 1/3*log2(1/3)+ 1/3*log2(1/3)]=Pos3: H(p)= -[2/3*log2(2/3)+ 1/3*log2(1/3) =
Relative EntropyThe Kullback-Leiber distance D
How different is an amino acid distribution pa compared to a background distribution qa - i.e. distance D between them.
D(p||q) = a pa log2(pa/qa)
Normally a background distribution of the amino acids isobtained as frequencies from a large database like UniProt.
Ala (A) 7.82 Gln (Q) 3.94 Leu (L) 9.62 Ser (S) 6.87Arg (R) 5.32 Glu (E) 6.60 Lys (K) 5.93 Thr (T) 5.46Asn (N) 4.20 Gly (G) 6.94 Met (M) 2.37 Trp (W) 1.16Asp (D) 5.30 His (H) 2.27 Phe (F) 4.01 Tyr (Y) 3.07Cys (C) 1.56 Ile (I) 5.90 Pro (P) 4.85 Val (V) 6.71
Information content
D(p||q) = a pa log2(pa/qa) Often the Information content is used as a measure of thedegree of conservation.
I = a pa log2(pa/qa)
A special case is that where all amino acids have the same background distribution: qa = 1/20
Information content
• I = a pa log2(pa/(1/20)) • = a pa [log2pa - log2(1/20)]
• = -H(p) - a palog2(1/20)
• = -H(p) + a palog2(20)
• = -H(p) + log2(20)
• = -H(p) + 4.32
Information content
• I = -H(p) + 4.32 = a palog2pa + 4.32
The Information content is at its maximum when then the entropy is zero - i.e. A fully conserved position in a multiple alignment.
Mutiple alignment of 3 sequences:Seq1: A L RSeq2: A V RSeq3: A I K
Pos1: I = -[1*log2(1)]+ 4.32 = 4.32
Pos2: I = -[1/3*log2(1/3)+ 1/3*log2(1/3)+ 1/3*log2(1/3)] + 4.32 =Pos3: I = -[2/3*log2(2/3)+ 1/3*log2(1/3) + 4.32=
GTTCTTCGTGTTTATTTTTAGGAAATTGATGATTGTTTCTCCTTTTAAAATAGTACTGCTGTTTTTTACTAACGACACATTGAAGAAATCACTTTGGATACGCTTACCGTTATCCAGAGCTACAGCGCTACTAATATGTAATACTTCAGCTCCCCTTAATATTGAGATCTTTTTTAACTAGTTAGGTCTACCTTCTCCCCTTCTTCATTTTAGCCTGTTTGGACTAACATAACTTATTTACATAGTGCCATTGAACGATATTTCCCGTTGTGTTAAGGCTGAGAAGAATTTTCCCGACCATCAAGACAGGTGATTTATCATGCAAAAACTTTTTTTCACAGGGCTAACTTGCGTTTATTGTGTTTCCACTCAGTTAAAAAACGAAACGTACTTTAATATTTATAGTACTTCATTCGAACATGCTATTTTTCATACAGCAACCTCACATCTGCACTCATCATTAGATTAGAGGAACATGGATACTTTTCTTTATCTAAGCAGCTAACTCAACTATCAACATGCTATTGAACTAGAGATCCACCTATAACTAACATGACTTTAACAGGGCTAATTTACAGTACTAACTAATTAACTTAGAACATTAACATGATCACCGTCACATTTATTAGAATTTCAAACGCAGTGGAATTTTTTTTTCTAGAAATGGTATCGCTCTATGACCAATAAAAACAGACTGTACTTTCAAATGGTATTATTTATAACAGTTGAACATTTCATAAATATGCGATCAATATAGACCGTTGATATATTTTACTTTTTTTTTTTTAGGAGCTCCAAGAATTTATTTCCTTATAATACAGACACGGTTACATCGCAATTAATTTTCTAATAGTTTTTCATTTTGACCATCTTTCTTTTCCCCAGTGCTAAACACGAACCTTCTTTCTCATTCGTAGATTACTGTTGCAATTACTAACAGCTGTAATAGCCGACAAATTTCTCTCTGCGCGTCCAATTTAGCTATACTGTTGTTGTTTTGTTTTGTCGTACAGTGTTTGGAGAAAAACTTCCATTTCTTACATAGATCATCGCCATTCCTTTCCATAATTTATTCAGCGCTTTGGTATCGATTTACTATTTCCATTTAGACGTTGTTCAAAATTTACTAACAATACTTCAGTTTATAATGGATCCTATACTAACAATTTGTAGTTCATAAATAA
A 94 88 84 75 78 78 71 69 70 60 68 77 32 49 87 93 93 134 9 266 0 86 66 85 81 89 81 88 82
C 31 45 52 44 56 46 62 54 56 51 46 37 30 42 32 44 30 25 122 1 0 38 65 52 43 62 62 57 43
T 113 110 113 117 104 117 111 120 118 125 136 140 182 155 122 100 124 75 137 0 0 72 85 82 91 83 73 67 96
G 30 25 19 32 30 27 24 25 24 32 18 14 24 22 27 31 21 34 0 1 268 72 52 49 53 34 52 56 47
Count nucleotides at each position:
A 0,35 0,33 0,31 0,28 0,29 0,29 0,26 0,26 0,26 0,22 0,25 0,29 0,12 0,18 0,32 0,35 0,35 0,50 0,03 0,99 0,00 0,32 0,25 0,32 0,30 0,33 0,30 0,33 0,31
C 0,12 0,17 0,19 0,16 0,21 0,17 0,23 0,20 0,21 0,19 0,17 0,14 0,11 0,16 0,12 0,16 0,11 0,09 0,46 0,00 0,00 0,14 0,24 0,19 0,16 0,23 0,23 0,21 0,16
T 0,42 0,41 0,42 0,44 0,39 0,44 0,41 0,45 0,44 0,47 0,51 0,52 0,68 0,58 0,46 0,37 0,46 0,28 0,51 0,00 0,00 0,27 0,32 0,31 0,34 0,31 0,27 0,25 0,36
G 0,11 0,09 0,07 0,12 0,11 0,10 0,09 0,09 0,09 0,12 0,07 0,05 0,09 0,08 0,10 0,12 0,08 0,13 0,00 0,00 1,00 0,27 0,19 0,18 0,20 0,13 0,19 0,21 0,18
Convert to frequencies:
Frequency-logo:
Logo plots - HowTo
Logo plots - Information Content
Sequence-logo
Calculate Information Content
I = apalog2pa + log2(4), Maximal value is 2 bits
• Total height at a position is the ‘Information Content’ measured in bits.• Height of letter is the proportional to the frequency of that letter.• A Logo plot is a visualization of a mutiple alignment.
~0.5 each
Completely conserved
Programs to make a Logo plot
• WebLogo• Requires a mutiple alignment as input• Protein or DNA sequences• More output formats
• Blast2Logo• Requires a fasta file as input• Only protein sequences• Runs PSI-blast and makes a table of frequencies• pdf logo plot
WebLogo - http://weblogo.berkeley.edu/
WebLogo - http://weblogo.berkeley.edu/
Find important positions>sp|Q00017|RHA1_ASPAC Rhamnogalacturonan acetylesteraseMKTAALAPLFFLPSALATTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGRSARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGVNETILTFPAYLENAAKLFTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAGVEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVLTTTSFEGTCL
What is the next step ?
1 Find homologous sequences - how ?
- Blast or PsiBlast- Download sequences- Make a mutiple alignment- ClustalW or others- or use Blast2Logo program
Mutiple alignment programs
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
Blast2logo - http://www.cbs.dtu.dk/biotools/Blast2logo-1.0/
Important positions
Important positions in proteins are conservedpositions => high Information Content.
Conserved for a reason:• Functionally important positions
• Catalytic residues
• Structurally important positions• Manitain the correct fold of the protein
Blast2logo
Runs iterative blast i.e. Psi-Blast
Searching for homologues sequences by useof Position Specific Scoring Matrices (PSSM).
1. Iteration - use Blosum62 scoring matrix2. Iteration - make profile of seq found in iteration 13. Iteration - make profile of seq found in iteration 24. Iteration - Calculate aa freq at each position inquery sequence. Correct for low counts and weightseq such that very similar seq are down weighted
Important positions - counting
Example. Where is the active site?• Sequence profiles might show you where to look!• The active site could be around
• S9, G42, N74, and H195
Exercise
1. Calculate nucleotide frequencies from a mutiple alignment of human donor sites
2. Calculate Entropy and Information content
3. Draw (by hand) a Logo plot
4. Use 2 Logo plot programs
5. Learn to interpret Logo & frequency plots
6. Active site residues & structural residues