distances
DESCRIPTION
Distances. A natural or ideal measure of distance between two sequences should have an evolutionary meaning. One such measure may be the number of nucleotide substitutions that have accumulated in the two sequences since they have diverged from each other. - PowerPoint PPT PresentationTRANSCRIPT
DistancesDistances
A natural or ideal measure of distance between two sequences should have an evolutionary meaning.
One such measure may be the number of nucleotide substitutions that have accumulated in the two sequences since they have diverged from each other.
To derive a measure of distance, we need to make several simplifying assumptions regarding the probability of substitution of a nucleotide by another.
Jukes & Cantor Jukes & Cantor one-parameter one-parameter
modelmodel
Assumption:Assumption:• Substitutions occur with equal probabilities Substitutions occur with equal probabilities
among the four nucleotide types.among the four nucleotide types.
Kimura’s two-parameter
model
Assumptions:
• The rate of transitional substitution at each nucleotide site is per unit time.
• The rate of each type of transversional substitution is per unit time.
NUMBER OF NUCLEOTIDE NUMBER OF NUCLEOTIDE SUBSTITUTIONS BETWEEN SUBSTITUTIONS BETWEEN
TWO DNA SEQUENCESTWO DNA SEQUENCES
After two nucleotide sequences diverge from each other, each of them will start accumulating nucleotide substitutions.
If two sequences of length N differ from each other at n sites, then the proportion of differences, n/N, is referred to as the degree of divergence or Hamming distance.
Degrees of divergence are usually expressed as percentages (n/N 100%).
The observed number of differences is likely to be smaller than the actual number of substitutions due to multiple hits at the same site.
13 mutations=
3 differences
Number of substitutions between
two noncoding sequences
The one-parameter model
In this model, it is sufficient to consider only I(t), which is the probability that the nucleotide at a given site at time t is the same in both sequences.
where p is the observed proportion of different nucleotides between the two sequences.
V (K) p p2
L 14
3p
2
L = number of sites compared in the ungapped alignment between the two sequences.
The two-parameter model
The differences between two sequences are classified into transitions and transversions.
P = proportion of transitional differences
Q = proportion of transversional
differences
ATCGGACCCG
Q = 0.2P = 0.2
V(K) 1
LP
1
1 2P Q
2
Q1
2 4P 2Q
1
2 4Q
2
P
1 2P Q
Q
2 4P 2Q
Q
2 4Q
2
Numerical example (2P-model)
-Substitution schemes with more than two parameters.
- Parameter-free substitution schemes.
Number of substitutions between
two protein-coding genes
Number of synonymous substitutions
Number of synonymous sites
Number of nonsynonymous substitutions
Number of nonsynonymous sites
1. The classification of a site changes with time: For example, the third position of CGG (Arg) is synonymous. However, if the first position changes to T, then the third position of the resulting codon, TGG (Trp), becomes nonsynonymous.
Difficulties with denominator:
T Trp
Nonsynonymous
2. Many sites are neither completely synonymous nor completely nonsynonymous. For example, a transition in the third position of GAT (Asp) will be synonymous, while a transversion to either GAG or GAA will alter the amino acid.
Difficulties with denominator:
Difficulties with nominator:
1. The classification of the change depends on the order in which the substitutions had occurred.
Difficulties with nominator:
2. Transitions occur with different frequencies than transversions.
3. The type of substitution depends on the mutation. Transitions result more frequently in synonymous substitutions than transversions.
Miyata & Yasunaga (1980)and
Nei & Gojobori (1986)method
U C A GUUU UCU UAU UGU UUUC
PheUCC UAC
TyrUGC
CysC
UUA UCA UAA Stop UGA Stop AU
UUGLeu
UCG
Ser
UAG Stop UGG Trp GCUU CCU CAU CGU UCUC CCC CAC
HisCGC C
CUA CCA CAA CGA AC
CUG
Leu
CCG
Pro
CAGGln
CGG
Arg
GAUU ACU AAU AGU UAUC ACC AAC
AsnAGC
SerC
AUAIle
ACA AAA AGA AA
AUG Met ACG
Thr
AAGLys
AGGArg
GGUU GCU GAU GGU UGUC GCC GAC
AspGGC C
GUA GCA GAA GGA AG
GUG
Val
GCG
Ala
GAGGlu
GGG
Gly
G
Step 1: Classify Nucleotides into non-degenerate, twofold and fourfold degenerate sites
L0
L2
L4
KS L
2A
2 L
4A
4L
2 L
4 B4
V(KS ) L2
2V(A2 ) L42V(A4 )
(L2 L
4)2
V(B4 ) 2b
4Q
4a
4P4
c4
(1 Q4
) L
2 L
4
KA A0 L
0B
0 L
2B
2L
0 L
2
V(KA ) V(A0 ) L0
2V(B0 ) L22V(B2 )
(L0 L
2)2
2b
0Q
0a
0P0 c
0(1 Q
0)
L0
L2
Number of Amino-Acid Replacements between Two Proteins
• The observed proportion of different amino acids between the two sequences (p) is
p = n /L
• n = number of amino acid differences between the two sequences
• L = length of the aligned sequences.
Number of Amino-Acid Replacements between Two Proteins
The Poisson model is used to convert p into the number of amino replacements between two sequences (d ):
d = - ln(1 – p)
The variance of d is estimated as
V(d) = p/L (1 – p)
How do you detect adaptive evolution at the genetic level?
48
Theoretical ExpectationsTheoretical Expectations
Deleterious mutations
Advantageous mutations
Neutral mutations
Overdominant mutations
49
50
51
52
53