sequence analysis, '18 -- lecture 9 · sequence analysis, '18 -- lecture 9 families and...
TRANSCRIPT
![Page 1: Sequence Analysis, '18 -- lecture 9 · Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9cb11b4d436047142eae52/html5/thumbnails/1.jpg)
Sequence Analysis, '18 -- lecture 9
Families and superfamilies.
Sequence weights.
Profiles. Logos.
Building a representative model for a gene.
![Page 2: Sequence Analysis, '18 -- lecture 9 · Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9cb11b4d436047142eae52/html5/thumbnails/2.jpg)
�2
How can I represent thousands of homolog sequences in a compact
way?
![Page 3: Sequence Analysis, '18 -- lecture 9 · Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9cb11b4d436047142eae52/html5/thumbnails/3.jpg)
Families and superfamilies
To model a whole family or superfamily, we need to sample sequence space. A family is the set of sequences of true homologs that can “see” each other in a database search. A superfamily is a group of distant homologs that cannot be easily found by sequence searches.
family
the weird shape of sequence space
superfamily
At short sequence distances, all sequences are homologs. At longer distances, no. Homologs are clustered and may be separated by spaces of non-homolog sequence space. Very, very distance sequences may be homologous.
Short evolutionary distance. Visible by pairwise comparison.Long evolutionary
distance. Visible by profile-based methods
![Page 4: Sequence Analysis, '18 -- lecture 9 · Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9cb11b4d436047142eae52/html5/thumbnails/4.jpg)
Redundancy problemIf we submit one sequence (for example, citrate synthase from human) to the GenBank database (using BLAST for example), and take 100 results, and we build a cladogram from this, we might get something like this...
primates rabbit rat E. colilawyer
What is our representative
going to look like if we use the rule:
"one sequence one vote"?
![Page 5: Sequence Analysis, '18 -- lecture 9 · Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9cb11b4d436047142eae52/html5/thumbnails/5.jpg)
Sequence weighting corrects for uneven sampling
To build a representative model we can...
(1) throw out all redundant sequences and keep representatives of each clade only, or
(2) apply a weight to each sequence reflecting how non-redundant that sequence is.
One measure of non-redundancy is sequence-distance, or evolutionary distance.
![Page 6: Sequence Analysis, '18 -- lecture 9 · Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9cb11b4d436047142eae52/html5/thumbnails/6.jpg)
Crude weights from a cladogramSimplest weighting scheme: Start with weight = 1.0 at the common ancestor of the tree. Split the weight evenly at each node.
primatesrabbit rat E. colilawyer
1.000
0.500
0.50
0
0.250
0.250
0.125
0.125
0.0625 0.0625
0.00
8
0.06
25
0.1250.125
0.06
25
0.01
60.
016
0.01
60.
016
0.01
6
0.00
80.
008
0.01
6
0.00
8
0.03
10.
031
0.03
10.
031
0.06
25
Human sequences are 10/18 of the tree, but only 0.125 of the weights
![Page 7: Sequence Analysis, '18 -- lecture 9 · Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9cb11b4d436047142eae52/html5/thumbnails/7.jpg)
�7
What part of the MSA measures time the best?
slow evolvinghot spot, fast evolving
![Page 8: Sequence Analysis, '18 -- lecture 9 · Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9cb11b4d436047142eae52/html5/thumbnails/8.jpg)
�8
Pruning and trimming to remove hotspots
Red boxes show blocks that are representative of the family -- useful for remote homolog detection, useful for phylogenetics if changes are sufficient.Yellow boxes show hot spots of mutation -- useful for near-homolog phylogenetic analysis, not for remote homolog detection.
![Page 9: Sequence Analysis, '18 -- lecture 9 · Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9cb11b4d436047142eae52/html5/thumbnails/9.jpg)
• Select a sequence• Find all near homologs using NCBI BLAST.
Keep 200.• Make a representative MSA using
MUSCLE as follows (using short-hand):Align... Trim... Align... Prune... Align... Prune... (Calculate distances...Prune... Align...)n
• Calculate a profile (you must trim all gaps first)
�9
In class exercise: Make a representative profile
![Page 10: Sequence Analysis, '18 -- lecture 9 · Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9cb11b4d436047142eae52/html5/thumbnails/10.jpg)
�10
Another Exercise: Re-aligning within a MSA
» Select region. MUSCLE | Align column range.» All gaps in an internal segment are internal gaps,
not end gaps.
» Useful for resolving the speciation order, sometimes.
![Page 11: Sequence Analysis, '18 -- lecture 9 · Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9cb11b4d436047142eae52/html5/thumbnails/11.jpg)
Better weights from a phylogram0.2
The sequence weight is calculated starting from the distance from the taxon to the first ancestor node, adding half of the distance from the first ancestor to the second ancestor, 1/4th of the distance from the second to third ancestor, and so on.
Finally, the weights are normalized to sum to 1.00
0.5
0.30.1
A
B
C
wA = 0.2 + 0.3/2 = 0.35
wB = 0.1 + 0.3/2 = 0.25
wC = 0.5
![Page 12: Sequence Analysis, '18 -- lecture 9 · Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9cb11b4d436047142eae52/html5/thumbnails/12.jpg)
Distance-based weights
Self-consistent Weights Method of Sander & Schneider, 1994
0.3 1.0
0.9
A
B
C
CBA all wi initialized to 1.
while (wi ≠ w'i) do for i from A to C do
w'i = Σj wj Dij
end do for i from A to C do
wi = w'i/ Σj w'j end do end do
(1) Sum the weighted distances to get new weights.(2) Normalize the new weights(3) Repeat (1) and (2) until no change.
Pseudocode :
![Page 13: Sequence Analysis, '18 -- lecture 9 · Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9cb11b4d436047142eae52/html5/thumbnails/13.jpg)
Distance-based weights 0.3 1.0
0.9
A
B
C
CBA
w'A = 0.3 + 1.0 = 1.3w'B = 0.3 + 0.9 = 1.2w'C = 1.0 + 0.9 = 1.9
wA = 1.3/(1.3+1.2+1.9)=0.30wB = 1.2/4.4 = 0.27wC = 1.9/4.4 = 0.43
w'A= 0.3*0.27+1.0*0.43=0.51w'B= 0.3*0.3+0.9*0.43 =0.48w'C= 1.0*0.3+0.9*0.27 =0.54... wABC = 0.33 0.31 0.35 wABC = 0.30 0.28 0.42 wABC = 0.31 0.29 0.40 wABC = 0.30 0.28 0.41 wABC = 0.30 0.28 0.41 converged.
(3) Repeat (1) and (2) until no change.
Running the pseudocode :
(1) Sum the weighted distances to get new weights.
(2) Normalize the new weights
(1) Sum the weighted distances to get new weights.
![Page 14: Sequence Analysis, '18 -- lecture 9 · Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9cb11b4d436047142eae52/html5/thumbnails/14.jpg)
Amino acid probability profilesAn amino acid profile is defined as a set of probability distributions over the 20 amino acids, a probability density function (PDF) for each position in the alignment. Gap probabilities are usually not included.
P(a|i) = ∑wS / ∑wSS ∀ Si=a
The probability of amino acid a at position i is the sum of the sequence weights wS over all sequences S such that the amino acid at position i of that sequence Si is a, divided by the sum over the sequence weights wS for all sequences S.
][
w
GDDTS
w
w
w
w
P(G|i) = 0.11P(D|i) = 0.5P(T|i) = 0.22P(S|i) = 0.28
![Page 15: Sequence Analysis, '18 -- lecture 9 · Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9cb11b4d436047142eae52/html5/thumbnails/15.jpg)
PSSMs, Log-likelihood ratios
LLR(a) = log( P(a|i) / P(a) )
probability of a in one column
"background" likelihood of a overall (the whole database)
position specific scoring matrix
A C D E F G H I K L M N P Q R S T V W Y
Amino acids are not equally likely in nature. K, L and R are the most common. M, C and W are the rarest.
![Page 16: Sequence Analysis, '18 -- lecture 9 · Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9cb11b4d436047142eae52/html5/thumbnails/16.jpg)
Pseudocounts account for unsampled data.
LLR(a) = log( P(a|i) / P(a) ) You can't take the log of zero.
The probability of seeing a in column i of a sequence alignment is never really zero.
So we add a small number of 'pseudocounts' ε.
LLR(a) = log( (P(a|i)+ε) / P(a) )
This LLR approaches log(ε/P(a)) in the limit as P(a|i)->0.
log(ε/P(a)) is a negative number.
![Page 17: Sequence Analysis, '18 -- lecture 9 · Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9cb11b4d436047142eae52/html5/thumbnails/17.jpg)
Smart Pseudocounts by guessing the missing counts,
using BLOSUM.
LLR(a) = log( (P(a|i)+εω) / P(a) )
ω = ∑ P(b|i) S(a|b)b
where
where
S(a|b) = probability of substitution b->a
Why “smart”? Because here we are using the observed counts to predict the unobserved counts based on the observed counts, as opposed to blindly applying pseudocounts.
![Page 18: Sequence Analysis, '18 -- lecture 9 · Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9cb11b4d436047142eae52/html5/thumbnails/18.jpg)
Color = LLR. Blue = high negative values. Green = zero. Red = high positive values.
Color matrix
One way to visualize profilesProline Helix C-cap Glycine Helix C-cap Glycine Helix C-cap
from I-sites library (Bystroff et al, 1998)
![Page 19: Sequence Analysis, '18 -- lecture 9 · Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9cb11b4d436047142eae52/html5/thumbnails/19.jpg)
�19
logos
» http://weblogo.berkeley.edu/logo.cgi
![Page 20: Sequence Analysis, '18 -- lecture 9 · Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9cb11b4d436047142eae52/html5/thumbnails/20.jpg)
Logos
Height of letter is the LLR in bits, half-bits or nats.
» http://weblogo.berkeley.edu/logo.cgi
![Page 21: Sequence Analysis, '18 -- lecture 9 · Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9cb11b4d436047142eae52/html5/thumbnails/21.jpg)
�21
Sometimes you can detect a pattern only after merging and
condensing alot of data.
![Page 22: Sequence Analysis, '18 -- lecture 9 · Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9cb11b4d436047142eae52/html5/thumbnails/22.jpg)
Alignments of transcription factor footprint sites
Transcription factors are homodimers, therefore they interact with... palindromes!
![Page 23: Sequence Analysis, '18 -- lecture 9 · Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9cb11b4d436047142eae52/html5/thumbnails/23.jpg)
�23
Protein sequence logo
What do you see? What might it mean?
![Page 24: Sequence Analysis, '18 -- lecture 9 · Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9cb11b4d436047142eae52/html5/thumbnails/24.jpg)
How do you compare a sequence against a profile?
KEMGFDHIIIHP
score = Σi LLR(ai)
The score is the sum of the log-likelihood ratios of the amino acid in the sequence.
Sequence=
![Page 25: Sequence Analysis, '18 -- lecture 9 · Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9cb11b4d436047142eae52/html5/thumbnails/25.jpg)
�25
Aligning sequence to profileS(i,j) = 0
do aa=1,20
S(i,j) = S(i,j) + P(aa,i)*B(aa,s(j))
enddo
Aligning profile to profileS(i,j) = 0
do aai=1,20
do aaj=1,20
S(i,j) = S(i,j) + P(aai,i)*P(aaj,j)*B(aai,aaj)
enddo
enddo
profile1@i P(aa|i)
BLOSUM score
No need to normalize, since ∑ ∑ P(aai|i)*P(aaj|j)= 1
sequence2@j
aai aaj
![Page 26: Sequence Analysis, '18 -- lecture 9 · Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9cb11b4d436047142eae52/html5/thumbnails/26.jpg)
Psi-BLAST is Blast with profiles
Psi-BLAST searches the database iteratively.(Cycle 1) Normal BLAST (with gaps)
(Cycle 2) (a) Construct a profile from the results of Cycle 1.
(b) Search the database using the profile.
(Cycle 3) (a) Construct a profile from the results of Cycle 2.
(b) Search the database using the profile.
And So On... (user sets the number of cycles)
Psi-BLAST is much more sensitive than BLAST.
Also more vulnerable to low-complexity.
Reminder:
![Page 27: Sequence Analysis, '18 -- lecture 9 · Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene](https://reader034.vdocuments.mx/reader034/viewer/2022043009/5f9cb11b4d436047142eae52/html5/thumbnails/27.jpg)
�27
Review• What is a sequence family? Superfamily?• What is a profile?• Why must sequences be weighted to calculate a
profile?• What is a distance matrix? What does it contain?
How is it used to get sequence weights?• What information is expressed in a sequence
logo?• What are pseudocounts? Why bother
pseudocounting?