alexis dereeper

15
Alexis Dereeper Homology analysis and molecular phylogeny CIBA courses – Brasil 2011

Upload: faolan

Post on 31-Jan-2016

39 views

Category:

Documents


0 download

DESCRIPTION

Homology analysis and molecular phylogeny. Alexis Dereeper. CIBA courses – Brasil 2011. Data selection. 4 steps for a phylogenetic analysis. 4. 2. 1. 3. Sequence alignment. Distance methods. Probabilistic methods. Method selection. Bayesian. Maximum likelihood. Parsimony. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Alexis Dereeper

Alexis Dereeper

Homology analysis and molecular phylogeny

CIBA courses – Brasil 2011

Page 2: Alexis Dereeper

Alexis Dereeper

Data selection

Sequence alignment

Method selection

Bayesian Maximum likelihood

Parsimony

Calculate or estimate the better tree fitting the data

Test the reliability of the obtained tree

Probabilistic methods Distance methods

Calculate distance

Model? Optimization

1

2

3

4

4 steps for a phylogenetic analysis

CIBA courses – Brasil 2011

Page 3: Alexis Dereeper

Alexis Dereeper

Phylogeny.fr“The Phylogeny.fr platform transparently chains programs to automatically perform phylogenetic analysis tasks”

CIBA courses – Brasil 2011

Page 4: Alexis Dereeper

Alexis Dereeper

Homology analysisWhat is sequence homology?

• Not a quantitative concept (to differentiate to similarity or identity : 28%identity): genes are homologous or not

• Homologs: genes coming from a common ancestor• Paralogs: homologs coming from a duplication event• Orthologs: homologs coming from a speciation event

• Homology and function: homology does not mean same function systematically. Closest orthologs may have the same function but more distant orthologs show rarely the same phenotypic role (but same role in a specific metabolic pathway)On the other hand, paralogs rapidly acquire different functions.

CIBA courses – Brasil 2011

Page 5: Alexis Dereeper

Alexis Dereeper

How are homologous sequences similar?

• From 100% identity to a few nt/aa in common

• No rule, no limit. Estimation is based on the probability that 2 sequences are similar by chance (e-value):

DNA: e-value < 10-6 et identity > 70% Protein: e-value < 10-3 et identity > 25%

• Sequences without noticeable resemblance can be homologous (similarity found at the 3D structure level).

• Otherwise, a important resemblance is generally interpreted as a homology, and not as a convergent evolution

CIBA courses – Brasil 2011

Homology analysis

Page 6: Alexis Dereeper

Alexis Dereeper

How to detect homology?

By sequence comparison= sequence alignment

1- Local alignment (ex:Blast) Conceived to search for similar regionsAlignment of a particular sequence against a bank of sequences

(Swith &Waterman)

2- Global alignment (ex: ClustalW)Conceived to compare homologous sequences on their full length

(Needleman & Wunsh)

CIBA courses – Brasil 2011

Homology analysis

Page 7: Alexis Dereeper

Alexis Dereeper

Classical Blast output

Different Blast programs :

● BlastN (Query: DNA / Subject : DNA)● BlastP (Query: protein/ Subject : protein)● BlastX (Query: DNA / Subject : protein)● TBlastN (Query: protein/ Subject : DNA)● TBlastX (Query: translated DNA / Subject : translated DNA)

scoreEvalue= inform the accuracy of score

CIBA courses – Brasil 2011

Homology analysis

Page 8: Alexis Dereeper

Alexis Dereeper

Blast Explorer

• Enable an assisted selection of homologous sequences using various criterias

• Post-processing of Blast results:

Guide tree (similarity tree) and possible selection on branches and leaves

Score / evalue distribution Taxonomic arborescence of hits

CIBA courses – Brasil 2011

Page 9: Alexis Dereeper

Alexis Dereeper

BBMH method (Best Blast Mutual Hits) ou RBH (Reciprocal Best Hit)

Ortholog databases/banks:

● Inparanoid (eukaryotes)● HomoloGene (eukaryotes)● OrthoMCL DB● COG (Clusters of Ortholog Groups of proteins) (prokaryotes et eukaryotes)● GreenPhyl (plants)

ProteomeSpecies1

ProteomeSpecies2

CIBA courses – Brasil 2011

Homology analysis

Page 10: Alexis Dereeper

Alexis Dereeper

Phylogenetic analysisStep 1 : Multiple alignment (global alignment)

• Alignment softwares: ClustalW Muscle Tcoffee 3DCoffee (optimize the alignment with 3D structure) Mafft

• Alignment formats : Fasta, Clustal, Phylip, Nexus

• Alignment visualization/edition softwares SeaView Jalview BioEdit

fast

slow

CIBA courses – Brasil 2011

Page 11: Alexis Dereeper

Alexis Dereeper

Step 2 : Alignment cleaning

• Removal of divergent regions showing a low phylogenetic signal (not very informative) These regions may not be homologous or may have been saturated by substitutions (ex: synonymous sites in coding regions)

=> Cleaned alignment more suitable for a phylogenetic analysis

• Alignment curation software GBlocks

CIBA courses – Brasil 2011

Phylogenetic analysis

Page 12: Alexis Dereeper

Alexis Dereeper

Step 3 : Phylogenetic reconstruction

Step 3a: Choose a method for phylogenetic reconstruction

• 4 main methods/algorithms: Distance method 2 by 2 (UPGMA, Neighbor Joining)

o FastDist, BIONJ, Neighbor Maximum parsimony

o DNAPars, TNT Maximum likelihood

o PhyML, PAML Bayesian inference

o MrBayes, Beast

• Output format : distance matrix, Newick format

Choose the correct compromise between speed and performance

CIBA courses – Brasil 2011

Phylogenetic analysis

Page 13: Alexis Dereeper

Alexis Dereeper

Step 3 : Phylogenetic reconstruction

Step 3b: Choose parameters and evolution models

• Different evolution models indicating the substitution rate for aa or nt: DNA

o Juke Cantor, Kimura, F81, HKY85, GTR protein

o JTT, WAG, Dayhoff

• Evolution test softwares: Test and selection of the best substitution model (and parameters) adapted to dataset (having the maximum likelihood)

ProtTest, ModelTest (based on PhyML)

CIBA courses – Brasil 2011

Phylogenetic analysis

Page 14: Alexis Dereeper

Alexis Dereeper

Step 3 : Phylogenetic reconstruction

Step 3c: Estimate the branch robustness

• Bootstrap procedure

1- Re-sampling of sequences on columns : creation of a pseudo-alignment by taking some sites randomly and tree computing again.2- Reiterate the process N times.3- For each branch of the initial tree, we count the number of times we can observe it into bootstrap trees. The higher is this number, the more accurate is the branch

• aLRT test (approximate Likelihood Ratio Test) (Anisimova & Gascuel, Syst Biol, 2006) Integrated in PhyML Much faster (PhyML launched only one time)

CIBA courses – Brasil 2011

Phylogenetic analysis

Page 15: Alexis Dereeper

Alexis Dereeper

Step 4 : Visualization and edition of phylogenetic tree

• Graphical tools available to display trees from Newick format : TreeDyn DrawGram, DrawTree ATV NJPlot

• Graphical output formats : PNG, SVG, PDF…

Step 5 : Interpretation of the tree

CIBA courses – Brasil 2011

Phylogenetic analysis