assignemnt on phylogency
TRANSCRIPT
Bioinformatics- Assignment 1 Report
Phylogeny Construction for Influenza viruses based on hemagglutinin sequence
Reihaneh Rabbany k.
Contents Assignment Description ................................................................................................................................ 2
First Step, Collecting Protein Sequences ...................................................................................................... 3
Influenza Virus .......................................................................................................................................... 3
Collecting Sequences ................................................................................................................................ 4
FASA Format .......................................................................................................................................... 4
Step 2, Computing Pairwise Distances and Multiple Alignment................................................................... 5
Scoring scheme ......................................................................................................................................... 5
ClustalW .................................................................................................................................................... 5
Step3, Phylogeny Construction ..................................................................................................................... 7
Distance-based phylogeny ........................................................................................................................ 7
Character-based phylogeny ...................................................................................................................... 8
Step4, Evaluation .......................................................................................................................................... 9
Consistency ............................................................................................................................................... 9
Bootstrapping ......................................................................................................................................... 10
References .................................................................................................................................................. 12
Figures
Figure 1- annotated phylogeny tree by distance method ............................................................................ 7
Figure 2- annotated phylogeny tree obtained by parsimony ....................................................................... 8
Figure 3 - Zoomed branch of distance tree (left) and parsimony tree (right) .............................................. 9
Figure 4- Comparing resulted cladogram from distance method (right) with the reported cladogram for
Influenza A virus by Yoshiyuki Suzuki, et. al. (left). ..................................................................................... 10
Figure 5- Bootstrapvalues_The upper is corresponding to parsimony method and the bottom one is
corresponded to distance method ............................................................................................................. 11
Bioinformatics- Assignment 1 Report
Phylogeny Construction for Influenza viruses based on hemagglutinin sequence
Reihaneh Rabbany k.
Assignment Description
The following is an H1N1 influenza virus hemagglutinin protein sequence, in FASTA format.
>gi|89903075|gb|ABD79112| /Human/4(HA)/H1N1//1946/// hemagglutinin [Influenza A virus A/Cam/46(H1N1))] MKAKLLILLCALSATDADTICIGYHANNSTDTVDTVLEKNVTVTHSVNLLEDSHNGKLCRLKGIAPLQLG
KCNIAGWILGNPECESLLSKRSWSYIAETPNSENGACYPGDFADYEELREQLSSVSSFERFEIFPKGRSW
PEHNIDIGVTAACSHAGKSSFYKNLLWLTEKDGSYPNLNKSYVNKKEKEVLILWGVHHPPNIENQKTLYR KENAYVSVVSSNYNRRFTPEIAERPKVRGQAGRINYYWTLLEPGDTIIFEANGNLIAPWYAFALNRGIGS
GIITSNASMDECDTKCQTPQGAINSSLPFQNIHPFTIGECPKYVRSTKLRMVTGLRNIPSIQSRGLFGAI
AGFIEGGWDGMIDGWYGYHHQNEQGSGYAADQKSTQNAINGITNKVNSVIEKMNTQFTAVGKEFNKLEKR MENLNKKVDDGFLDIWTYNAELLVLLENERTLDFHDSNVKNLYEKVKNQLRNNAKEIGNGCFEFYHKCNN
ECMESVKNGTYDYPKFSEESKLNREKIDGVKLESMGVYQILAIYSTVASSLVLLVSLGAISFWMCSNGSL
QCRICI
Detailed tasks:
1. Search for at least 100 other hemagglutinin protein sequences for influenza viruses, such that
they are distributed well in all 16 subtypes (H1–H16).
2. Using an appropriate scoring scheme to compute the pairwise distances between every pair of
sequences in the above; using the same scoring scheme, construct a multiple sequence
alignment for these sequences.
3. Use a distance-based and a character-based phylogeny construction method, together with an
out-group, to build two phylogenies for these sequences.
4. Evaluate the constructed phylogenies.
Note that the detailed descriptions of steps of operations you perform and the consequences of these
operations must be reported (for example, the number of sequences you collected from each database,
each tool you have called and their availability).
Bioinformatics- Assignment 1 Report
Phylogeny Construction for Influenza viruses based on hemagglutinin sequence
Reihaneh Rabbany k.
First Step, Collecting Protein Sequences At the first step, I should search for at least 100 other hemagglutinin protein sequences for influenza
viruses, such that they are distributed well in all 16 subtypes (H1–H16). For doing this, first I should get
familiar with Influenza virus.
Influenza Virus
“The influenza virus is an RNA virus comprises five genera: Influenzavirus A, Influenzavirus B,
Influenzavirus C, Isavirus, and Thogotovirus. The type A viruses are the most virulent human pathogens
and cause the most severe disease. The Influenza A genome encodes 11 proteins: hemagglutinin (HA),
neuraminidase (NA), nucleoprotein (NP), M1, M2, NS1, NS2(NEP), PA, PB1, PB1-F2 and PB2” [1].
“HA and NA are large glycoproteins on the outside of the viral particles; these proteins are targets for
antiviral drugs which are antigens to which antibodies can be raised. Influenza A viruses are classified
into subtypes based on antibody responses to HA and NA, forming the basis of the H and N distinctions
in, for example, H5N1” [1]. “There are 16 different HA antigens (H1 to H16) and nine different NA
antigens (N1 to N9) for influenza A”. [2].
Naming
Each subtype virus has mutated into a variety of strains1 [2]. Generally, influenza A variants are
identified according to the isolate that they are like and thus are presumed to share lineage (example
Fujian flu virus like); according to their typical host (example Bird flu, Human Flu, Swine Flu, Horse Flu,
Dog Flu); according to their subtype, an H number (for hemagglutinin) and an N number (for
neuraminidase) (example H3N2); and according to their deadliness (example LP) [2,3].
1 A strain is a genetic variant or subtype of a microorganism (e.g. virus). For example, a "flu strain" is a certain
biological form of the influenza or "flu" virus.
Bioinformatics- Assignment 1 Report
Phylogeny Construction for Influenza viruses based on hemagglutinin sequence
Reihaneh Rabbany k.
Collecting Sequences
I’ve used Influenza Virus Resources in NCNBI2 to retrieve the HA protein sequences for influenza. It
contains more than 11000 viruses. I simply requested the db to retrieves all complete HA sequences of
Influenza A and chose 7 of each subtype and download them in FASA format (The selected sequences
are mostly from USA and between 2000 and 2008 unless there is not enough number of such sequences
in these years).
Although large number of influenza sequences in NCBI, it contains only 2 H14 and 5 H15 subtypes.
Therefore, I used Uniport3 to find more sequences in these subtypes and I found 4 H14 and 7 H15
sequences there. Further, I searched BioHealth4 and I found 8 H15 and 2 H14 there.
All these results are intenerated and recorded in the name of “data/sequences.fasta”.
FASA Format
FASA is a text-based format for representing peptide sequences, in which amino acids are represented
using single-letter codes.
Description line begins with “>” symbol. The word following the ">" symbol is the identifier of the
sequence, and the rest of the line is the description (both are optional). There should be no space
between the ">" and the first letter of the identifier. In this case these descriptions contain the viruses’
location, host, year and subtype.
Amino acid codes
The amino acid codes supported are:
Amino Acid
A B C D E F G H I K L M N O P Q R S T U V W Y Z X * -
Meaning
Alanine
Aspartic acid or Asparagine
Cysteine
Aspartic acid
Glutamic acid
Phenylalanine
Glycine
Histidine
Isoleucine
Lysine
Leucine
Methionine
Asparagine
Pyrrolysine
Proline
Glutamine
Arginine
Serine
Threonine
Selenocysteine
Valine
Tryptophan
Tyrosine
Glutamic acid or Glutamine
Any
translation stop
gap of indeterminate length
2 http://www.ncbi.nlm.nih.gov/genomes/FLU/Database/select.cgi?go=1
3 http://www.uniprot.org/
4 http://www.biohealthbase.org/GSearch/home.do?decorator=Influenza
Bioinformatics- Assignment 1 Report
Phylogeny Construction for Influenza viruses based on hemagglutinin sequence
Reihaneh Rabbany k.
Step 2, Computing Pairwise Distances and Multiple Alignment Here I should use an appropriate scoring scheme to compute the pairwise distances between every pair
of sequences in the above; and then using the same scoring scheme, construct a multiple sequence
alignment for these sequences.
Scoring scheme
Scoring scheme contains biological information which determines how one should compute the
alignment. It includes substitution matrix (to assign scores to amino-acid matches or mismatches) and
gap penalties (for matching an amino acid in one sequence to a gap in the other) [7, 8].
The two common substitution matrixes are PAM series and BLOSUM series; when comparing closely
related proteins, one should use lower PAM or higher BLOSUM, for distantly related proteins higher
PAM or lower BLOSUM matrices [7].
ClustalW
For performing multiple sequence alignment, I’ve used ClustalW from PHYLIP package via Mobyle 5
webservice (a portal for bioinformatics analyses). I’ve also checked ClustalX with is a windows interface
for the ClustalW multiple sequence alignment program but as there are no different in functionality, I
keep on using the webservice.
“ClustalW is a progressive method that generates a multiple sequence alignment by first aligning the
most similar sequences and then adding successively less related sequences or groups to the alignment
until the entire query set has been incorporated into the solution. The initial tree describing the
sequence relatedness is based on pairwise comparisons.” [8]
For its scoring scheme I’ve selected the following settings for both Pairwise Alignments parameters and
Protein parameters of multiple sequence alignment 6:
Gap opening penalty: 10
Gap extension penalty: 0.2
Gap separation penalty range: 8
Delay divergent sequences: 30% identity for delay
Protein weight matrix: PAM series
Protein weight matrix for pairwise alignment: PAM350
5 http://mobyle.pasteur.fr/cgi-bin/MobylePortal/portal.py?form=clustalw-multialign
6 Command for rerunning it is:
clustalw -align -infile=sequences.fasta -type=protein -matrix=blosum -nopgap -nohgap -hgapresidues="RNDQEGKPS" -pwmatrix=blosum -
outfile=BlosumAligned
Bioinformatics- Assignment 1 Report
Phylogeny Construction for Influenza viruses based on hemagglutinin sequence
Reihaneh Rabbany k.
All the result of this section is reported under directory: “analysis\*\clustalw-multialign”
It includes a “sequences.aln” file that contains the multiple sequence alignment. There is also
“sequences.dnd” which contains the resulted tree and also “clustalw-multialign.out” that shows the
progress of this algorithm which includes the pairwise scores between each pair of sequences.
Bioinformatics- Assignment 1 Report
Phylogeny Construction for Influenza viruses based on hemagglutinin sequence
Reihaneh Rabbany k.
Step3, Phylogeny ConstructionIn this step I should use a distance
together with an out-group (which
leading to it), to build two phylogenies for these sequences.
Distance-based phylogeny
In distance-based tree reconstruction, we reconstruct an evolutionary tree from a distance matrix. As
most of distance measures don’t guarantee to produce a
clustering methods for building the tree;
Arithmetic Mean) which produce a
last) or NJ (Neighbor Joining) [9].
For constructing this phylogeny tree I used “Protdist” (Prote
toolbox and via Mobyle webservice
Multiple sequence alignment) which I further fed
obtain the corresponding phylogenetic tree. The resulted
in “analysis\*\Distance\protdist\protdist.outfile
“analysis\*\Distance\neighbor\neighbor.outtree
and “drawtree” toolbox in PHYLIP. The results are “
“analysis\*\Distance\tree.pdf”.
Figure 1
7 http://mobyle.pasteur.fr/cgi-bin/MobylePortal/portal.py?form=protdist
ln -s PAMali.phylipi infile && protdist <protdist.params && mv outfile protdist.outfile8 http://mobyle.pasteur.fr/cgi-bin/MobylePortal/portal.py?form=neighbor
ln -s protdist.outfile infile && neighbor <neighbor.params && mv outfile neighbor.outfile && mv outtree
Assignment 1 Report
Phylogeny Construction for Influenza viruses based on hemagglutinin sequence
Step3, Phylogeny Construction se a distance-based and a character-based phylogeny construction method,
(which specifies which species is to have the root of the tree be on the line
to build two phylogenies for these sequences.
based tree reconstruction, we reconstruct an evolutionary tree from a distance matrix. As
most of distance measures don’t guarantee to produce an additive matrix, we usually use a hierarchical
clustering methods for building the tree; such as UPGMA (Unweighted pair Group Method with
Arithmetic Mean) which produce an ultra-metric rooted tree (root corresponds to the cluster created
For constructing this phylogeny tree I used “Protdist” (Protein Sequence Distance Method) from PHYLIP
toolbox and via Mobyle webservice7. This program computed a distance matrix (based on the given
which I further fed into NJ algorithm (again via Mobyle webservice
onding phylogenetic tree. The resulted protdis’s resulted distance matrix is reported
protdist.outfile” and the NJ’s resulted tree is reported under
neighbor.outtree”. I’ve plotted its cladogram and tree using “drawgram”
and “drawtree” toolbox in PHYLIP. The results are “analysis\*\Distance\cladogram.pdf
1- annotated phylogeny tree by distance method
bin/MobylePortal/portal.py?form=protdist With commant: s PAMali.phylipi infile && protdist <protdist.params && mv outfile protdist.outfile
bin/MobylePortal/portal.py?form=neighbor With commant: s protdist.outfile infile && neighbor <neighbor.params && mv outfile neighbor.outfile && mv outtree neighbor.outtree
based phylogeny construction method,
specifies which species is to have the root of the tree be on the line
based tree reconstruction, we reconstruct an evolutionary tree from a distance matrix. As
additive matrix, we usually use a hierarchical
such as UPGMA (Unweighted pair Group Method with
metric rooted tree (root corresponds to the cluster created
in Sequence Distance Method) from PHYLIP
(based on the given
algorithm (again via Mobyle webservice8) to
distance matrix is reported
resulted tree is reported under
m and tree using “drawgram”
ladogram.pdf” and
neighbor.outtree
Bioinformatics- Assignment 1 Report
Phylogeny Construction for Influenza viruses based on hemagglutinin sequence
Reihaneh Rabbany k.
Character-based phylogeny
Instead of computing distances from alignment matrix
to construct the tree, we could use the alignment matrix directly to build the evolutionary tree by
character-based methods (these methods try to explain the best c
that they describe their successors species
or protein sequence of that species
the one that needs minimum number of changes
For constructing this phylogeny tree I used “ProtPars” (Protein Sequence Parsimony Method) from
PHYLIP toolbox and via Mobyle webservice
“analysis\*\Parisomy\protpars\” folder. I’ve a
“drawtree” toolbox in PHYLIP. The results are “
“analysis\*\Parisomy \tree.pdf”.
Figure 2-
9 http://mobyle.pasteur.fr/cgi-bin/MobylePortal/portal.py?form=protpars
ln -s BlosumAligned.phylipi infile && protpars <protpars.params &&
Assignment 1 Report
Phylogeny Construction for Influenza viruses based on hemagglutinin sequence
phylogeny
f computing distances from alignment matrix (� ��) and using these distance matrix (
to construct the tree, we could use the alignment matrix directly to build the evolutionary tree by
based methods (these methods try to explain the best character strings for internal nodes so
that they describe their successors species – here character string of a species is the amino
species); such as maximum parsimony method that define the best tree as
at needs minimum number of changes [9].
For constructing this phylogeny tree I used “ProtPars” (Protein Sequence Parsimony Method) from
PHYLIP toolbox and via Mobyle webservice9. The results are reported under
” folder. I’ve also plotted its cladogram and tree using “drawgram” and
“drawtree” toolbox in PHYLIP. The results are “analysis\*\Parisomy\cladogram.pdf
”.
annotated phylogeny tree obtained by parsimony
bin/MobylePortal/portal.py?form=protpars With commant: s BlosumAligned.phylipi infile && protpars <protpars.params && mv outfile protpars.outfile && mv outtree protpars.outtree
and using these distance matrix (� � �)
to construct the tree, we could use the alignment matrix directly to build the evolutionary tree by
haracter strings for internal nodes so
is the amino-acids string
); such as maximum parsimony method that define the best tree as
For constructing this phylogeny tree I used “ProtPars” (Protein Sequence Parsimony Method) from
The results are reported under
lso plotted its cladogram and tree using “drawgram” and
ladogram.pdf” and
mv outfile protpars.outfile && mv outtree protpars.outtree
Bioinformatics- Assignment 1 Report
Phylogeny Construction for Influenza viruses based on hemagglutinin sequence
Reihaneh Rabbany k.
Step4, Evaluation For evaluating the constructed phylogenies
biological information and by bootstrapping.
Consistency
For evaluating the resulted trees I compare how consistent they are with
data I have both what are given in protein sequences’ descriptions
this, first, I renamed the sequences
using “\RenameSequences\Renaming
and readable (Figure 1 and 2). I used these renamed sequences to build the phylogenetic trees by both
distance based and parsimony methods
\Parisomy\cladogram.pdf” and “analysis
trees that the viruses in a same subtype are grouped in the same clad
these algorithms with biological information.
corresponding virus year. For illustrating them I zoomed in branch H7 of both trees:
Figure 3 - Zoomed branch of distance tree
Based on Figure 1, 2, and 3, using this scoring scheme,
distances for more closely sequences
Further I compared the resulted cladogram
there is a high agreement between my resul
influenza A viruses subtypes (see Figure 4
Assignment 1 Report
Phylogeny Construction for Influenza viruses based on hemagglutinin sequence
the constructed phylogenies, I took two approaches by consistency of them with
biological information and by bootstrapping.
For evaluating the resulted trees I compare how consistent they are with the taxonomy
are given in protein sequences’ descriptions and using others’ results
I renamed the sequences so that their subtype plus their year location becomes their identifier
Renaming\Rename.java”. In this way the resulted tree becomes meaningful
I used these renamed sequences to build the phylogenetic trees by both
distance based and parsimony methods and the resulted cladogram trees are “analysis
” and “analysis\ Consistence\Distance\cladogram.pdf”. We could see in these
same subtype are grouped in the same clade which shows the consistency of
with biological information. Moreover most of the branches are consistence with the
For illustrating them I zoomed in branch H7 of both trees:
Zoomed branch of distance tree (left) and parsimony tree (right)
using this scoring scheme, the parsimony method exhibit the evolutionary
sequences better that the distance method.
cladogram trees with trees reported by Yoshiyuki Suzuki,
between my results and results presented on that paper about divergence of
Figure 4).
, I took two approaches by consistency of them with
taxonomy or biological
and using others’ results. For doing
their subtype plus their year location becomes their identifier
In this way the resulted tree becomes meaningful
I used these renamed sequences to build the phylogenetic trees by both
trees are “analysis\Consistence
e could see in these
which shows the consistency of
consistence with the
method exhibit the evolutionary
Yoshiyuki Suzuki, et. al. [11] and
paper about divergence of
Bioinformatics- Assignment 1 Report
Phylogeny Construction for Influenza viruses based on hemagglutinin sequence
Reihaneh Rabbany k.
Figure 4- Comparing resulted cladogram from distance method
Yoshiyuki Suzuki, et. al.
Bootstrapping
Apart from consistency, I evaluate the resulted tree by bootstrapping.
parameters in “ProtPars” and “ProtDist
bootstrapped trees and then I fed these trees into
Mobyle webservice10
). This program generated a c
that shows the agreement on that branch between
TreeWithBootstrapValues.txt” and
technical point is that, for the distance
tried the PHYLIP package directly;
bootstrapped MSA. Using these bootstrapped samples
obtain consensus tree with bootstrap
the webservice. By the way, the results are
10
http://mobyle.pasteur.fr/cgi-bin/MobylePortal/portal.py?form=consenseln -s protpars.outtree intree &&consense <consense.params && mv outfile consense.outfile && mv outtree consense.outtree
Assignment 1 Report
Phylogeny Construction for Influenza viruses based on hemagglutinin sequence
Comparing resulted cladogram from distance method (right) with the reported cladogram for Influenza A virus by
uzuki, et. al. (left) One could see that clades are mostly identical.
Apart from consistency, I evaluate the resulted tree by bootstrapping. For doing this,
ProtDist”+”NJ” to perform bootstrapping and they
ed these trees into “Consensus” tree program in PHYLIP toolbo
This program generated a consensus tree with bootstrap values
n that branch between all bootstrapped trees (see “\Boostrap
and “\Boostrap\Distance\TreeWithBootstrapValues.txt
distance method the “Pratdist” webservice is extremely slow;
; I used “seqboot” in PHYLIP package to generate
these bootstrapped samples I produce 100 trees by “neighbor
tree with bootstrap values by “Consensus”. This one is still slow but much faster than
the results are not surprisingly identical.
bin/MobylePortal/portal.py?form=consense With commant: s protpars.outtree intree &&consense <consense.params && mv outfile consense.outfile && mv outtree consense.outtree
reported cladogram for Influenza A virus by
doing this, I simply set
they produced 100
e program in PHYLIP toolbox (via
values on its branches
Boostrap\Parisomy\
TreeWithBootstrapValues.txt” [10]. One
extremely slow; therefore, I
ate to generate 100
neighbor” and finally
. This one is still slow but much faster than
s protpars.outtree intree &&consense <consense.params && mv outfile consense.outfile && mv outtree consense.outtree
Bioinformatics- Assignment 1 Report
Phylogeny Construction for Influenza viruses based on hemagglutinin sequence
Reihaneh Rabbany k.
Comparing the resulted bootstrap values in
that the parsimony method is by far better than the distance method in this specific task
dataset and settings; as the most of branches in its tree has 100%
values, while the distance method produces
example compare these two branches with similar
from parsimony method (note that bootstrap values are between 0 and 1
lower one is from distance method.
Figure 5- Bootstrapvalues_The upper is corresponding to
method
Assignment 1 Report
Phylogeny Construction for Influenza viruses based on hemagglutinin sequence
Comparing the resulted bootstrap values in consensus trees of distance and parsimony
method is by far better than the distance method in this specific task
as the most of branches in its tree has 100% bootstrap values
, while the distance method produces relatively poor bootstrap values in the branches.
example compare these two branches with similar species and different bootstraps. The upper one is
note that bootstrap values are between 0 and 1, i.e. 1 mean 100%
lower one is from distance method.
The upper is corresponding to parsimony method and the bottom one is corresponded to distance
parsimony method revealed
method is by far better than the distance method in this specific task and with these
or high bootstrap
poor bootstrap values in the branches. For
aps. The upper one is
1 mean 100%) and the
method and the bottom one is corresponded to distance
Bioinformatics- Assignment 1 Report
Phylogeny Construction for Influenza viruses based on hemagglutinin sequence
Reihaneh Rabbany k.
References [1] http://en.wikipedia.org/wiki/Influenza
[2] http://en.wikipedia.org/wiki/Influenzavirus_A
[3] http://en.wikipedia.org/wiki/Influenza_Genome_Sequencing_Project
[4] http://www.ncbi.nlm.nih.gov/genomes/FLU/FLU.html
[5] http://www.ncbi.nlm.nih.gov/genomes/FLU/Database/select.cgi?go=1
[6] http://www.uniprot.org/
[7] http://en.wikipedia.org/wiki/Substitution_matrix
[8] http://en.wikipedia.org/wiki/Sequence_alignment
[9] N. C. Jones and P. A. Pevzner. "An Introduction to Bioinformatics Algorithms". MIT Press. 2004
[10] http://bioweb2.pasteur.fr/docs/phylip/doc/consense.html
[11] Yoshiyuki Suzuki, et. al., Origin and Evolution of Influenza Virus Hemagglutinin Genes, Molecular
Biology and Evolution, Oxford University Press, April 1, 2002