multiple s equence alignment and their reliability
DESCRIPTION
Multiple s equence alignment and their reliability. The Bioinformatics Unit G.S. Wise Faculty of Life Science Tel Aviv University, Israel January 2013 By Haim Ashkenazy http://guidance.tau.ac.il/workshop_2013/. What are alignments good for?. To compare sequences Find homology - PowerPoint PPT PresentationTRANSCRIPT
Multiple Multiple sequence sequence
alignment and alignment and their reliabilitytheir reliability
The Bioinformatics UnitG.S. Wise Faculty of Life Science
Tel Aviv University, IsraelJanuary 2013
By Haim Ashkenazy
http://guidance.tau.ac.il/workshop_2013/
January 2013 1TAU Bioinformatics Workshop
What are alignments good What are alignments good for?for?
• To compare sequenceso Find homologyo Similar sequence similar function
• To learn about sequence evolutiono Mismatch = point mutationo Gap = indel (insertion or deletion)o Reconstruct phylogenetic treeo Infer selection forces, e.g., detecting positive selection, co-
evolving sites
• For structure predictiono Similar regions potentially have similar structure
2
Making an alignment Making an alignment (pairwise)(pairwise)
ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN-CDRYYQ
ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN CDRYYQ
• For 2 sequences – Pairwise alignmento Local alignment – finds regions of high
similarity in parts of the sequences.
o Global alignment – finds the best alignment across the entire two sequences
• Use exact solutiono Needleman-Wunsch (for global) or Smith-Waterman (for local) -
http://www.ebi.ac.uk/Tools/psa/
3
Sequences evolutionSequences evolutionATGAAATAA
ATGTTTTAA ATGCCCAAATAA
ATGTTTTCA ATGTTTTAA ATGCCCAAA
A T G - - - T T T T A A
A T G - - - T T T T C A
A T G C C C A A A - - -
30 MYA
5 MYA
Today
Human
Chimp
Mouse4
A T G - - - T T T T A A
A T G - - - T T T T C A
A T G C C C - - - A A A
Alignment and phylogeny Alignment and phylogeny are mutually dependentare mutually dependent
Inaccurate tree
building
MSA
Sequence alignment
0.4
Phylogeny reconstructi
on
Unaligned sequences
5
Alignment and phylogeny Alignment and phylogeny are both are both challengingchallenging
~25% of residues are wrongly alignedBased on BAliBASE: a large representative set of proteins
6
Alignment and phylogeny Alignment and phylogeny are both are both challengingchallenging
5% of tree branches are wrong
Based on simulations of 100 protein sequences
Making an alignment (MSA)Making an alignment (MSA)• For more sequences - Multiple sequence
alignment (MSA)o Exact methods are not feasible (too slow)o We use heuristic methodso Several advanced MSA programs are available
Basically two recommended methods:• MAFFT – fastest and one of the most
accurate• PRANK – distinct from all other MSA
programs because of its correct treatment of insertions/deletions
8
ABCDE
Compute the pairwise Compute the pairwise alignments for all alignments for all
against all (10against all (10 pairwise pairwise alignments).alignments).
The similarities are The similarities are converted to distances converted to distances and stored in a tableand stored in a table
First step: compute pairwise distances
Progressive alignmentProgressive alignment
A B C D E
A
B 8
C 15 17
D 16 14 10
E 32 31 31 32 9
A
D
C
B
E
Cluster the sequences to create Cluster the sequences to create a tree (a tree (guide treeguide tree):):
• represents the order in which represents the order in which pairs of sequences are to be pairs of sequences are to be alignedaligned• similar sequences are neighbors similar sequences are neighbors in the tree in the tree • distant sequences are distant distant sequences are distant from each other in the treefrom each other in the tree
Second step:build a guide tree
A B C D E
A
B 8
C 15 17
D 16 14 10
E 32 31 31 32The guide tree is imprecise and is NOT the tree which truly describes the evolutionary relationship between the sequences!
10
Third step: align sequences in a bottom up order
A
D
C
B
E
1. Align the most similar (neighboring) pairs
2. Align pairs of pairs
3. Align sequences clustered to pairs of pairs
deeper in the tree
Sequence ASequence B
Sequence CSequence D
Sequence E
11
Multiple sequence alignment (MSA)progressivprogressiv
ee
alignmentalignment
ABCDE
Guide tree
A
DCB
E
MSA
Pairwise distance table
Iterative
12
Sources of alignment Sources of alignment errorserrors
Progressive alignment algorithms are greedy heuristics
Co-optimal solutions Heads-or-Tails (HoT) scores (Landan & Graur 2007)
GEELTNWPSPVCHNRLASGIDDSTAFRFPRPQKWIISYSLHCVI...GEELTLWPSPVCHNRLASGIDASIAFRFPRAQKRFYRYSLHCVI...TEELTHWPFPVCRNRLARGIGSAIAFRCPRSQEHI-RNSLPCVI...TEELRYWPFPVCQN--ARGNGSVIEARNPGSQ-----KVLPYVI...
...IVCHLSYSIIWKQPRPFRFATSDDIGSALRNHCVPSPWNTLEEG
...IVCHLSYRYFRKQARPFRFAISADIGSALRNHCVPSPWLTLEEG
...IVCPLSNRI-HEQSRPCRFAIASGIGRALRNRCVPFPWHTLEET
...IVYPLVK-----QSGPNRAEIVSGNGRA--NQCVPFPWYRLEET
13
…MSA 1 MSA 2 MSA 99 MSA 100
Progressive alignment
…Tree 1 Tree 2 Tree 99 Tree 100
Bootstrap sampling of NJ trees
Base alignment
GUIDANCE Scores
0
1
Penn, Privman et al. MBE. 2010
GUIDANCE: Guide-tree based GUIDANCE: Guide-tree based alignment confidence scoresalignment confidence scores
14
Comparing alignmentsComparing alignmentsCommon measures to quantify distance between two MSAs:1.CS: Each column of the MSA that is identically aligned in the other MSA is given a score of 1; all other columns are given the score 0.2.SP: Each pair of residues in the MSA that is identically aligned in the other MSA is given a score of 1; all other residue pairs are given the score 0.3.Sum-of-pairs column score (SPC): The score of each column is simply the average of the SPs over all pairs in it.
Accuracy of GUIDANCE Accuracy of GUIDANCE scoresscores
16
http://guidance.tau.ac.il
As a rule of thumb, use HoT for less than 8 sequences
17
http://guidance.tau.ac.il
Un-aligned sequences
(FASTA format)
Choose sequence
type
Choose alignment
method
18
GUIDANCE resultsGUIDANCE results
04/19/23Footer Text 19
MSA colored by
confidence score
Confident
Uncertain
Sequence score
Column score
GUIDANCE resultsGUIDANCE results
GUIDANCE outputsGUIDANCE outputs
21
Download MSA for down-stream
analysis
Text files with all scores
Mask residue by score
Remove unreliable sequences
Confident
Uncertain
Sequence score
Column score
GUIDANCE resultsGUIDANCE results
22
GUIDANCE outputsGUIDANCE outputs
23
Remove unreliable sequences
Re-align sequences after filtration
Sequences left after filtration
Filtering sequences Filtering sequences with low scores and with low scores and
re-alignre-align
24
But always remember not to
remove too much data and
consider the biology…
GUIDANCE outputsGUIDANCE outputs
25
Remove unreliable columns
MSA after filtration
Filtering columns with Filtering columns with low scoreslow scores
26
GUIDANCE outputsGUIDANCE outputs
27
Masking unreliably aligned residues
Filtering residues with Filtering residues with low scoreslow scores
28
Filtering unreliable regions Filtering unreliable regions
can improve down-stream can improve down-stream
analysisanalysis
29
(Mol Biol Evol 2012;29:1-5)
AcknowledgmentsAcknowledgments• Prof. Tal Pupko• Dr. Eyal Privman• Dr. Osnat Penn• Pupko’s lab members
1. Penn, O., Privman, E., Ashkenazy, H., Landan, G., Graur, D. and Pupko, T. (2010).GUIDANCE: a web server for assessing alignment confidence scores.Nucleic Acids Research, 2010 Jul 1; 38 (Web Server issue):W23-W28; doi: 10.1093/nar/gkq443 [ABS] [PDF]
2. Penn, O., Privman, E., Landan, G., Graur, D. and Pupko, T. (2010).An alignment confidence score capturing robustness to guide-tree uncertainty. Molecular Biology and Evolution, 2010 Aug;27(8):1759-67; doi:10.1093/molbev/msq066 [ABS] [PDF]
3. Landan, G., and D. Graur. (2008).Local reliability measures from sets of co-optimal multiple sequence alignments.Pac Symp Biocomput 13:15-24 [ABS] [PDF]
30
Thanks for your Thanks for your attention!attention!
31
1. Download and save the sequences file.
(http://guidance.tau.ac.il/workshop_2013/) "Seq_For_GUIDANCE.fs" (File
“Save as”). This file contains 20 protein sequences in FASTA
format.
2. Run GUIDANCE web-server to create a protein alignment:
a. Use GUIDANCE algorithm
b. Select “amino acids” as the sequences type;
c. Select MAFFT as the alignment method
d. Run (press the “Submit“ button) .
e. (In case it does not run for you, you can see the results at:
http://guidance.tau.ac.il/results/13589321556364/output.php)
3. What is the alignment score? What does it mean about the alignment achieved?
4. Which sequences can be removed to improve the alignment? What is
the biological justification for that? Try it!
Appendix – MSA serversAppendix – MSA servers
33
MAFFTMAFFT• Web server & download:
http://mafft.cbrc.jp/alignment/server/
34
Choosing a MAFFT Choosing a MAFFT strategy strategy
quick & dirty slow
but accurate
• Efficiency-tuned variants quick & dirty or slow but accurate
Choosing a MAFFT Choosing a MAFFT strategy strategy
quick & dirty slow
but accurate
Choosing a MAFFT Choosing a MAFFT strategy strategy
quick & dirty slow
but accurate
Choosing a MAFFT strategy Choosing a MAFFT strategy
L-INS-i
ooooooooooooooooooooooooooooooooXXXXXXXXXXX-XXXXXXXXXXXXXXX------------------
--------------------------------XX-XXXXXXXXXXXXXXX-XXXXXXXXooooooooooo-------
------------------ooooooooooooooXXXXX----XXXXXXXX---XXXXXXXooooooooooo-------
--------ooooooooooooooooooooooooXXXXX-XXXXXXXXXX----XXXXXXXoooooooooooooooooo
--------------------------------XXXXXXXXXXXXXXXX----XXXXXXX------------------
G-INS-i
XXXXXXXXXXX-XXXXXXXXXXXXXXX
XX-XXXXXXXXXXXXXXX-XXXXXXXX
XXXXX----XXXXXXXX---XXXXXXX
XXXXX-XXXXXXXXXX----XXXXXXX
XXXXXXXXXXXXXXXX----XXXXXXX
E-INS-i
oooooooooXXX------XXXX---------------------------------XXXXXXXXXXX-XXXXXXXXXXXXXXXooooooooooooo
---------XXXXXXXXXXXXXooo------------------------------XXXXXXXXXXXXXXXXXX-XXXXXXXX-------------
-----ooooXXXXXX---XXXXooooooooooo----------------------XXXXX----XXXXXXXXXXXXXXXXXXooooooooooooo
---------XXXXX----XXXXoooooooooooooooooooooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXX-------------
---------XXXXX----XXXX---------------------------------XXXXX---XXXXXXXXXX--XXXXXXXooooo--------
quick & dirty slow
but accurate
MAFFT outputMAFFT outputA colored view of the
alignmentChoose a format: Clustal, Fasta and save as text file
Run GUIDANCE also from here!!
PRANK
Classical alignment errors for HIV env
PRANKPRANK• Web server: http://www.ebi.ac.uk/goldman-srv/webPRANK/
PRANK outputPRANK output
If you need a different format – copy the results to the READSEQ sequence converter: http://www-bimas.cit.nih.gov/molbio/readseq/