adaptive evolution on genes and genomes - genómica · pdf file ·...
TRANSCRIPT
Adaptive Evolution on Genes and Genomes
Hernán J. Dopazo
Evolutionary Genomics UnitBioinformatics & Genomics DepartmentCentro de Investigación Príncipe FelipeValencia. Spainhttp://hdopazo.bioinfo.cipf.es/
Genomes & Systems
Universitat Pompeu Fabra
Barcelona, Spain
Sunday, 14 November 2010
2
Natural Selection
The mechanisms in The mechanisms in
which relative which relative
frequencies of frequencies of
genotypes change genotypes change
according to their according to their
relative fitnesses in relative fitnesses in
the populationthe population
Positive SelectionPositive Selection
Purifying SelectionPurifying Selection
3
Natural Selection
Natural selection is a process of pervasive Natural selection is a process of pervasive
importance in the biological word, which includes importance in the biological word, which includes
our own species, and on which that species is our own species, and on which that species is
utterly depenent. Progress in evolutionary biology utterly depenent. Progress in evolutionary biology
and its applications is perhaps most obviously and its applications is perhaps most obviously
relevant to medical and environmental issues, but relevant to medical and environmental issues, but
there is no aspect of human life for which an there is no aspect of human life for which an
understanding of evolution is not a vital neccestity.understanding of evolution is not a vital neccestity.
George C. Williams, Plan and Purpose in Nature, 1996
4
Positive SelectionThe mechanism whereby newly produced mutants have higherhigher fitnessesfitnesses than the average in the population and the frequencies of the mutants increaseincrease in the following generations
The biological consequence of such a mechanism is: adaptationadaptation
5
Purifying SelectionThe mechanism whereby newly produced mutants have lowerlower fitnessesfitnesses than the average in the population, and the frequencies of the mutants decreasedecrease in the following generations
The biological consequence of such a mechanism is the mantainement ofmantainement ofadaptationadaptation in the population
6
Neutral EvolutionThe great majority of evolutionary changes at the molecular level are caused by random drift of selectively neutral or nearly neutral mutations
The relevant consequence of such a mechanism is the fixation of fixation of mutations at a mutations at a constant rate with constant rate with in the population
7
Measuring natural selection at molecular Measuring natural selection at molecular levellevel
Selective pressure at the protein level can be measured as: =dN/dS
where, where, dN is the number of nonsynonymous substitutions per nonsynonymoudN is the number of nonsynonymous substitutions per nonsynonymous site s site and dSand dS is is the number of synonymous substitutions per synonymous site betwethe number of synonymous substitutions per synonymous site between two en two proteinprotein--coding sequencescoding sequences
8
Measuring natural selection at molecular Measuring natural selection at molecular levellevel
Selective pressure at the protein level can be measured as: =dN/dS
where, where, dN is the number of nonsynonymous substitutions per nonsynonymoudN is the number of nonsynonymous substitutions per nonsynonymous site s site and dSand dS is is the number of synonymous substitutions per synonymous site betwethe number of synonymous substitutions per synonymous site between two en two proteinprotein--coding sequencescoding sequences
If non-synonymous mutations are favoured by positive selection, non-synonymous mutations will be fixed at faster rate than synonymous mutations, then > 1
9
Measuring natural selection at molecular Measuring natural selection at molecular levellevel
Selective pressure at the protein level can be measured as: =dN/dS
where, where, dN is the number of nonsynonymous substitutions per nonsynonymoudN is the number of nonsynonymous substitutions per nonsynonymous site s site and dSand dS is is the number of synonymous substitutions per synonymous site betwethe number of synonymous substitutions per synonymous site between two en two proteinprotein--coding sequencescoding sequences
If non-synonymous mutations are favoured by positive selection, non-synonymous mutations will be fixed at faster rate than synonymous mutations, then > 1
If non-synonymous mutations are deleterious (purifying selection), synonymous mutations will be fixed at faster rate, then < 1
10
Measuring natural selection at molecular Measuring natural selection at molecular levellevel
Selective pressure at the protein level can be measured as: =dN/dS
where, where, dN is the number of nonsynonymous substitutions per nonsynonymoudN is the number of nonsynonymous substitutions per nonsynonymous site s site and dSand dS is is the number of synonymous substitutions per synonymous site betwethe number of synonymous substitutions per synonymous site between two en two proteinprotein--coding sequencescoding sequences
If non-synonymous mutations are favoured by positive selection, non-synonymous mutations will be fixed at faster rate than synonymous mutations, then > 1
If non-synonymous mutations are deleterious (purifying selection), synonymous mutations will be fixed at faster rate, then < 1
If selection has no effect on fitness (neutral evolution), synonymous and non synonymous mutations will be fixed at equal rate, then = 1
11
Sequence pairwise comparison: Counting methods
Suppose a gene has 300 codons and we observe 3 synonymous and 3 nonsynonymous diferences between two sequences.
Is = 1 ?
S S S
N N N
HCh 1 3002992
If mutations from any one nucleotide to any other occur at the same rate, we expect 25.5% of mutations to be synonymous and 74.5% to be nonsynonymous (a default consequence of the genetic code table).
We do not expect synonymous and nonsynonymous mutations at equalproportions even if there is no selection at the protein level.
12
Sequence pairwise comparison: Counting methods
Suppose a gene has 300 codons and we observe 3 synonymous and 3 nonsynonymous diferences between two sequences.
Is = 1 ?
S S S
N N N
HCh 1 3002992
The total number of synonymous (S) and nonsynonymous (N) sites probably are close to:
S ; N 5.229255.030035.670745.03003Therefore,
336.05.229
35.670
30.01310.0044 0.0131 dS ; 0.0044 Nd
13
Counting methods (3 steps)
1. Count the total number of N and S nucleotide sites
Complicated by factors such as ts/tv bias and base/codon frequency bias!!!
H CCA CCG AAA ACC �… TCA GTA CGG AAT
Ch GCA CCG AAA AGC �… ACA GTA GGG AAC1 2 3 4 �… 697 698 699 700
S ; N 6.768366.070034.1331634.07003
14
Counting methods (3 steps)
1. Count the total number of N and S nucleotide sites
Complicated by factors such as ts/tv bias and base/codon frequency bias!!!
H CCA CCG AAA ACC �… TCA GTA CGG AAT
Ch GCA CCG AAA AGC �… ACA GTA GGG AAC1 2 3 4 �… 697 698 699 700
S ; N 6.768366.070034.1331634.07003
2. Count the synonymous (S) and nonsynonymous (N) differences
N NS SS
H CCA CCG AAA ACC �… TCA GTA CGG AAT
Ch GCA CCG AAA AGC �… ACA GTA GGG CAC1 2 3 4 �… 697 698 699 700
S
This is straigthtforward if the two compared codons differ at onecodon position only. When they differ at 2 or 3 codon positions, there exists 4 or 6 pathways from one codon to the other. The multiple pathways may involve different number of synonymous and nonsynonymous and should ideally be weighted appropriately according to their likelihood of occurrence. Most counting methods use equal weighting.
15
Counting methods (3 steps)
1. Count the total number of N and S nucleotide sites
Complicated by factors such as ts/tv bias and base/codon frequency bias!!!
H CCA CCG AAA ACC �… TCA GTA CGG AAT
Ch GCA CCG AAA AGC �… ACA GTA GGG AAC1 2 3 4 �… 697 698 699 700
S ; N 6.768366.070034.1331634.07003
2. Count the synonymous (S) and nonsynonymous (N) differences
N NS SS
H CCA CCG AAA ACC �… TCA GTA CGG AAT
Ch GCA CCG AAA AGC �… ACA GTA GGG CAC1 2 3 4 �… 697 698 699 700
S
This is straigthtforward if the two compared codons differ at onecodon position only. When they differ at 2 or 3 codon positions, there exists 4 or 6 pathways from one codon to the other. The multiple pathways may involve different number of synonymous and nonsynonymous and should ideally be weighted appropriately according to their likelihood of occurrence. Most counting methods use equal weighting.
3. Apply a correction for multiple substitution at the same site
16
Counting methods:The method of Miyana-Yasunaga (1980, J Mol Evol,16(1):23), and its simplified
version (Nei-Gojobori, 1986, Mol Biol Evol,3(5):418) are based on nucleotide substitution model of Jukes and Cantor (1969) and ignore the ts/tv bias or base codon frequency.
Since ts are more likely to be synonymous than tv at 3rd. position, ignoring the ts/tv rate bias understimate the number of S and overestimate N. This effect is well known, and different methods account for this ratio
(Li, et al. 1985, Mol Biol Evol, 2(2):150, Li, 1993, J Mol Evol, 36(1):96, Pamilo and Bianchi 1993, Mol Biol Evol, 10(2):271, Ina, 1995, J Mol Evol, 40(2):190)
The effect of biased base/codon frequencies can have devastating effects on the estimation of dN and dS. Qualitatively different conclusions were reached depending on whether codon usage bias is accomodated for nucler genes from mammals and Drosophila.
A counting method incorporating both the ts/tv bias and the base/codon frequency biaswas implemented by Yang and Nielsen, 2000, Mol Biol Evol,17(1):32.
Many, if not all of them, are incorporated in codeml (PAML) program.
17
Codon Model
.transition ousnonsynonym aby differj and i if
on,transversi ousnonsynonym aby differj and i if
,transition s synonymouaby differj and i if
on,transversi s synonymouaby differj and i if positions, codon 3 or 2 at differj and i if
q
j
j
j
j
ij
,
,
,
,,0
q Q ij
Markov model of codon substitution (61 states or sense codons) (Goldman & Yang, 1994 Mol Biol Evol, 11, Pp 725; Muse & Gaut, 1994, Mol Biol Evol, 11, Pp 715
The instantaneous substitution rate from codon i to codon j
Codon instantaneous rate matrix
j j
j j
j j
j j
qqqqq
qqqqq
qqqqq
qqqqq
0...
0.....................
0...
0...
,6161,6160,612,611,61
,6061,6060,602,601,60
,261,260,22,21,2
,161,160,12,11,1
Pu A G tstv
Py C T ts
Frequency of codon j
ts/tv rate ratio
dN/dS rate ratio
The model accounts for ts/tv bias, unequal synonymous and nonsynonymous substitution, and biased base/codon frequencies.
18
Codon ModelFor example, consider the substitution rates to codon CTG (Leu). We have:
position one than more at differCTG TTT the since q
,transition ousnonsynonym a is change CTG(Leu) GTG(Pro) the since µ q
on,transversi ousnonsynonym a is change CTG(Leu) GTG(Val) the since µ q
,transition s synonymoua is change CTG(Leu) TTC(Leu) the since µ q
on,transversi s synonymoua is change CTG(Leu) CTC(Leu) the since µ q
CTGTTT,
CTGCTGCCG,
CTGCTGGTG,
CTGCTGTTC,
CTGCTGCTC,
0
.transition ousnonsynonym aby differj and i if
on,transversi ousnonsynonym aby differj and i if
,transition s synonymouaby differj and i if
on,transversi s synonymouaby differj and i if positions, codon 3 or 2 at differj and i if
q
j
j
j
j
ij
,
,,
,,0
Pu A G tstv
Py C T ts
19
Codon Model
Molecular sequence data do not allow separate estimation of rate (µ) and time (t), and only their product (µt) can be identified.
We thus fix the rate µ such that the expected number of nucleotide substitutions per codon is one.
This scaling means that time (t) is measured by genetic distance -the expected number of (nucleotide) substitutions per codon.
The transition probability matrix over time t is:
etptP Qtij )()(
CTCCCC
=1t
p
Lastly, the model is time-reversible
This means, )()( tptp jijiji
20
Pairwise Comparison:Maximum Likelihood (ML) estimation of
H CCC CCG �… ACC h=1 h=2 �… h=n
Ch CTC CCG �… AGC
Xh=(x1,x2)
tn codon sites
CCC CTCp
The probability of site h,
)()())( ,1
61
1,,
tptp(tpxf CTCCCCCCC0k
h CTCkCCCkk
x1 x2 x1 x2
t0 t1
k= ancestral codon, unkown!!
t = t0 + t1
21
Pairwise Comparison:Maximum Likelihood (ML) estimation of
)()(211 , tpxf xxxh
Parameters in the model: the sequence divergence (t), transition/transversion rate ratio ( ),the nonsynonymous/synonymous rate ratio ( ) ,the codon frequency, ( j)
...estimated from the observed data (base �–F3x4- or codon frequencies �–F61)
The log-likelihood function for these sequence is then given by:
)(log),,(1
h
n
h
xftl
22
Maximum Likelihood (ML) estimation of
�• Since there is no analytic solution, a numerical optimization algorithm is used to maximize �“l�”
)(log),,(1
h
n
h
xftl
23
Comparing pairwise methods
Codeml
YN00
PAML
Codon bias Transition bias
24
Phylogenetic (ML) estimation of
Pairwise methods has low power to detect positive selection
It averages over all sites along all the evolutionary distance separating sequences.
Only 17 out of 3600 genes (Endo et al. 1996, MolBiolEvol, 13. Pp. 685)
Power is improved if selective pressure is allowed to vary over
sites or branches.
Increasing the complexity of the codon model in this way requires
that likelihood be calculated for multiple sequences on a
phylogeny.
25
Likelihood calculation for multiple sequences on a phylogeny
Given an unrooted tree with N=4 species
The probability of observing the data at codon h, is:
k gxggxgggkkxkkxkkh tptptptptpxf )()()()()()( 44,3,0,2,1, 321
k g
x1
x2x4
t1 t3
t2 t4
N-2 ancestral nodes
t0
Xh=(x1,x2,x3,x4)x3
The data at each The data at each site will be the sum site will be the sum over 61over 61(N(N--2)2) possible possible combinations of combinations of ancestral nodesancestral nodes
The log-likelihood function is the sum over all (h=n) codon sites:
)(log),,(1
)32( h
n
hN xftl We assumed a single We assumed a single value for value for
all the sequences in the treeall the sequences in the tree
26
Modelling variable selective pressure among lineages Yang, Z. 1998. Mol. Biol. Evol. 15:568.
Adaptive evolution is most likely to occur in an episodic fashion.
The probability of observing the data at codon h,
k g
x1
x2
x3
x4
t1, 0
Xh=(x1,x2,x3,x4)
The transition probabilities for The transition probabilities for the 2 set of branches are the 2 set of branches are calculated from different rate calculated from different rate matrices (Q) generated by using matrices (Q) generated by using different different ratiosratiost4, 1
t3, 1
t2, 0
t0, 0
The most general The most general model called the model called the �“free�“free--ratio model�”ratio model�” specified specified an independent an independent for for
each branch of the treeeach branch of the tree
k gxggxgggkkxkkxkkh tptptptptpxf ),(),(),(),(),()( 144,13,00,02,01, 321
The log-likelihood function is the sum over all (h=n) codon sites:2 2 values estimated for values estimated for
different lineages in the tree different lineages in the tree averaging over all sites!!
)(log),,,(1
10)32( h
n
hN xftl
averaging over all sites!!
27
Modelling variable selective pressure among lineages Yang, Z. 1998. Mol. Biol. Evol. 15:568.
Adaptive evolution is most likely to occur in an episodic fashion.
28
Modelling variable selective pressure among sites Yang, Z. and R. Nielsen, 2002. Mol. Biol. Evol. 19: 908-917.
0=0 0< 1<1
The most used approach is to use a statistical distribution to model the random variation in over sites (ie.: beta, gamma distribution, etc.)
Codeml has 13 alternative models available.Continuous distributions are approximated by discrete categories: K
Ki classes, with i different ratios and pi proportions
If K=2, i=(0,1)
p1p0
k g
x1
x2x4
t1 t3
t2 t4
t0
x3
k gxggxgggkkxkkxkkih tptptptptpxf )()()()()()( 44,3,0,2,1, 321
Since we do do not known to Since we do do not known to wich class each site h belongs, wich class each site h belongs, we sum over both classeswe sum over both classes
)()(1
0ih
i
iih xfpxf
)(log1
h
n
h
xfl
29
Models of variable ratio among sites
30
Bayes Empirical Bayes (BEB)
Yang Z., W. Wong & R Nielsen, 2005, Mol Biol Evol 22. Pp 1107
After ML estimation of model�’s parameters, an empirical Bayes approach is used to infer which class each site is most likely to belong.
The posterior probability that site h with data xh is from sitek is:
)()(
)(h
khkhk xf
xfpxprob
31
Bayes empirical bayes results
Neutral
Positive
Purifying
32
Yang, Z. and R. Nielsen, 2002. Mol Biol Evol 19. Pp. 908
Branch-sites models
Usefull to test adaptive process ocurred at a subset of sites for a limited period of time.Usefull to test adaptation after gene duplicaton
2>1 p2a , p2b
, p
0< 0<1 , 0< 0<1 pForeg
round
Background
PositivePositive selectionselection
NeutralityNeutrality
PurifyngPurifyng selectionselection
Test I: M1a (neutral) vs A1(Positive selection) models
k= 3, site classes (0, 1, 2)
33
Likelihood ratio test �…nested models comparison
22,05.011
2dfa
lll AMAccept the alternative hypothesis of positive selection!!!
34
Thanks!
Acknowledgements
Chapter 5. Bielawsky J. & Z. Yang, 2005
Micostrium VulgarisPhylum: ChordataSubphylum: VertebrataClass: Mammalia
36
Topics
Human EvolutionPositive selection / relaxation- PLoS Comp. Biol., 2006
Brain-specific genes- submitted, Mol Biol Evol
Human DiseaseHuman cSNPs functional prediction- JMB, 2006
Pupas Web server- NAR, 2006
37
Ancient Positive Selection on Human-Chimp GenomeComparative Genomics DataMain Publications
38
Our work�…why?�•�• MainMain QuestionsQuestions
Which are the full set of genes and functions that evolve outside of the molecular clock hypothesis?
Which are the full set of genes and functions that were positively selected during evolution of each species, and which show evidence of relaxation?
How do these sets of genes compare amongst themselves and in between derived and ancestral lineages at a functional level?
39
Comparing Human-Chimp PSGs studies
NO
---
YES-NO
---
YES
GOGO--PSGsPSGs
differencesdifferences
betweenbetween
HH--CC
YES
MK
NO
YES
YES
ML ML
modelsmodels
insteadinstead
Ka/Ka/KsKs
ratioratio
YES
---
YES
YES
NO
MultipleMultiple
testingtesting
correctioncorrection
YES
NO
YES
NO
NO
DifferencesDifferences
PSPS--RXRX
NOApoptosisGametogenesisImmune responseSensory perceptionmRNA transcriptionTranscription factor
10,767:304-H
Bustamante, et Bustamante, et al. al.
NatureNature 20052005Celera
YESSpermatogenesisPerception of soundReproductionOlfactationImmune response
13,454:585 (H-C)
TheThe ChimpChimp SeqSeq. . AnalysisAnalysis
ConsortiumConsortiumNatureNature 20052005
ArbizaArbiza, et al. , et al. PLoS CB 2006PLoS CB 2006
Ensembl
NielsenNielsen, et al. , et al. PLoS PLoS BiolBiol 20052005
Celera
Clark, et al. Clark, et al. Science 2003Science 2003
Celera
NOImmune responseSensory perceptionSpermatogenesisApoptosis, Cell cyle
8,079:733 (H-C)35 p<0.05
YESTranscription factorCell. Prot. Metab.OlfactationImmune responseG-PCR
13,198108-H577-C
YESOlfactationSensory perceptionG-PCR
7,645:1,547-H1,534-C
HH--C C
differentialdifferential
lineagelineage
analyssanalyss
MainMain GOGO--PSGsPSGs
categoriescategories
((BiologicalBiological
processprocess))
NumberNumber
ofof PSGsPSGs
inferredinferred
40
Positive selection, relaxation and clockDerived and ancestral lineages
2>1 p2a , p2b
, p
0< 0<1 , 0< 0<1 p
Foreground
Background
PositivePositive selectionselectionNeutralityNeutrality
PurifyngPurifyng selectionselection
TestingTesting PS PS andand RXRX(CodeML, PAML, Zhang, Nielsen, Yang, Mol Biol
Evol 2005)
Branch-site models: site classes (0, 1, 2)
Test I: M1a (neutral) vs A (Positive selection) models
Test II: A1 (neutral) vs A (Positive selection) models
TestingTesting ClockClock(RRTree, Robinson-Rechavi, Bioinformatics
2000)
41
Human and Chimp evolves at equal rates
42
Gene functions outside clock behaviour are non-significantly different between H-Ch
43
PSG�’s deduced from Ka/Ks>1 vs. ML branch-site model approximations
A minor proportion of genes with
Ka/Ks>1 match events of PS
Ka/Ks>1 with PS
Ka/Ks<1 with PS
Ka/Ks>1 without PS
Ka/Ks<1 without PS
44
Genes
Functional Analysis
Relative Ancestral and Derived change
45
46
Conclusions INon-neutral evolution is an infrequent process shaping the pattern of divergence
between human and chimp genomes.665 Human (5%) and 1,341 Chimp (10%) genes after correcting for multiple testing
Use of mean normalized Ka rate approaches (Ka/Ks) for concentrating cases of
positive selection should be discarded in favor of more sensitive methodologies.
Functional classes encompassed by the sets of genes evolving without clock and
those under the influence of positive selection in both species were found to be
largely the same and in similar proportions.
However the set of PSG�’s were different in each species.
Comparisons of relative trends among derived and ancestral lineages may provide further insight on H and Ch differencesOut of 59 GO categories:
41 showed a relative increase in Human greater than in Chimp11 showed a relative increase in Chimp greater than that in Human
Suggesting Human may have grown further apart from the ancestral lineage in common GO categories than Chimp has.
47
Positive selection: Nervous System
48
Brain genes are between the most conservedgenes analysed in PS studies
49
How different is the evolution ofhuman brain-specific genes from:
�• others human T-SG�’s?�• and in between lineages?
Are there any effects of usingalternative statistical methods
�• Rate estimation?�• Mean estimation?
Main Questions
..!!?..!!?
286 H-TSGs, 551 HKGs, 9 Tissues
SEWM SEWM estimationestimation usingusing a freea free--branchbranch ML ML modelmodel
50
Re-Analysis of Dorus, et al. (2004) data
SEWM SEWM estimationestimation usingusing a freea free--branchbranch ML ML modelmodel
51
Some of the H-TSG�’s under PS (Test II)
52
Conclusions II
Estimates of average dN/dS ratios are sensitive to the methods used for combining
estimates and the methods used for estimation of dN/dS.
Brain specific genes show no evidence for acceleration in the primate lineage
compared to many other tissue specific gene categories.
There is no evidence for an elevation in the dN/dS ratio in brain-specific genes in
humans compared to chimpanzees, neither between primates and murids taxa.
The number of brain-specific genes showing evidence for positive selection is higher
in the chimpanzee than in the human evolutionary lineage.
While there undoubtedly has been much positive selection relating to brain function
during the evolution of modern man, this selection has not been so pervasive that is
has resulted in a detectably accelerated rate of molecular evolution.
53
HoweverHowever�…�…TIG, november 2006
Human Disease & Evolution
55
Medicine believes in genetics, less in bioinfoand almost nothing in evolutionary biology!!
Medicine seldom takes into account evolutionary
biology�’s conclusionsParasites should evolve towards a bening coexistence with their host�…
It strongly depends on mode of inheritance (Paul Ewald, 1996)
Scientists working in biomedicine rarely recognize
basic evolutionary biology conceptsPAML matrix
Positional homology
56
Evolutionary Thinking in BiomedicineNotable exceptions
1996. Paul M. Ewald. Evolution of infectious disease�”Parasites vertically inherited should evolve toward a bening coexistence with their host�”.
1996. R. Nesse and G. Williams. Why we get sick? The newscience of Darwinian Medicine
�“Why in a body of such exquisite design, are there a thousands flaws and fraities that make us vulnerable to disease?... They suggest new ways of addressing illness�”
1998. Stephen C. Stearns. Evolution in health and disease�“... our body was shaped by natural selection to maximize reproductive success in ancestral environments�”.
2002. Steve A. Frank. Immunology and Evolution of Infectiousdisease
Bring the gap between immunology and epidemiology.
57
Example: SNP�’s and Disease
SNPs can cause alterations of gene function by�…Alterations in expression levelAlterations in expression levelAlternative splicingAlternative splicingAlteration (or loss) of gene product functionAlteration (or loss) of gene product function
Changes in the stability of the proteinChanges in the stability of the proteinFunctionally important residuesFunctionally important residuesPhylogenetic conservationPhylogenetic conservation
Natural selection working at codon levelNatural selection working at codon level
58
Main Question
CouldCould anan estimatorestimator ofof thethe selectiveselective
preassurespreassures actingacting atat codon codon levellevel (( )) be be
usedused as a as a predictorpredictor ofof thethe phenotypephenotype
effecteffect ofof SNP´sSNP´s??
59
Detecting Positive & Negative Selection
SiteSite--specific modelsspecific modelsaverage dN/dS over lineages but differentiate over sites
123
dN/dS
dN/dS
ML models for positive selection
M1a vs M2a; M7 vs M8
Bayes Empirical Bayes (BEB)
M2a, M8
(which class (h) each site is
most likely to belong).
Sitewise likelihood-ratio method
(SLR)
60
Bayes Empirical Bayes (BEB)
Neutral
Positive
Purifying
Evolutionary models in action
p53
DNA
A real case A real case withwith
thethe master master
proteinprotein ofof thethe cellcell
62
TP53 protein
Probably p53p53 is the main protein regulating cell division and apoptosis
Many mutant formsmutant forms are involved in different types of human cancer
63
EvolutionaryModels in TP53
�• M1a vs M2a -> no PS
�• M7 vs M8 -> no PS
�•logL(M2a)> logL (M8) -> **best
�• SLR -> Other alternative method�•(multiple testing 95%, 99%)�• ---, ----�• +++, ++++
64
TP53 Evolutionary Analysis
DB and TR domains have the lowest value distribution
65
DNA-binding domain evolutionary and biologicalanalysis
SLR: 0.1, 0.1 < 0.2, 0.2 < 0.3, >0.3
A
DNA
L1 L2S1 S2 S2’ S3 S4
100 120 140 160
L3L2 L2S5 S6 S7
H1 S8
180 200 220 240
L3 H2S9 S10
260 280
!!+! !! ! ! !+! ++!+!!!!!!!!!! !!!!!!!!!!!!!!!! ! !!+!+ !!!!!!!!! !
!!!!!!!!!!!!! ! !! !! !!!!!!!!! ! ! !! !! +!!!+!!!! !!!!!!+!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!! !+!!!!!!!!!!!!!!!!!!!+!!!+!!
B
169–MTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSS-241** * *** ** ** ***** * * *** **** * ** *****
96–SVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQH-168** * * * *** *** *** * *** ** * ** ** * **
242–CMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENL-289********** ** *** * *** **** ******** ** *
66
TR-domain evolutionary and biological analysis
SLR: 0.1, 0.1 < 0.2, 0.2 < 0.3, >0.3
B
A
335 345
!!!!!!!!!+ !!!!+!! !!!!!!+!+ 325–GEYFTLQIRGRERFEMFRELNEALELKDAQA-355
* * * * * * *
67
How effective is natural selection ?
§ Evolutionary biologists recognized that natural selection works in proportion to the number of deleterious mutations in the population
68
TP53 mutation freq. and selective constraints
According to the theory this will follow an �“L�” shape curve
The Main Question again
p53 results seem to show good signals, however,
Is it possible to obtain a specific predictor of the
more frequent amino acid changes associated to
human diseases?
70
Bioinformatics and evolutionary analysis
Analyse DB containing codon mutation frequencies for all the possible
human diseases proteins
Immune deficiency and cancer (COSMIC) databases (approx. 250
genes)
Ensembl-orthologous genes in different species
Mammals and Vertebrates
Evolutionary ML analysis
(M1a, M2a, M7, M8, SLR)
Statistical tests (KS)
reject genes with <10 mutations
71
and frequency distribution of Immune and Cancer mutations
threshold= 0.1
TwoTwo samplesample KolmogorovKolmogorov--SmirnovSmirnov testtestH0: freq. (lower ) = freq. (upper )
HA: freq. (lower ) > freq. (uppper )
72
Conclusions III
We have found an evolutionary parameter that allows to
differentiate amino acids where disease is more frequent
This parameter is a measure of the action of natural selection
working on vertebrate species during million years
We hypothesize that non-synonymous changes on amino acids
showing < 0.1 probably affects the normal function of proteins
Recently we confirmed this results using more than 3,000 proteins
Disease and polymorphisms are differentiated using values
73
Selective constraints on all the cSNP�’s of the Human Genome
PAMLPAML--SLRSLR
Evolutionary Models Evolutionary Models
74
Bioinformatic Tool: SNP�’s probable associated to mendelian diseases (NAR, web issue 2006)
76
Thanks�…again!!Thanks�…again!!
Micostrium VulgarisPhylum: ChordataSubphylum: VertebrataClass: Mammalia