COT 6930HPC and Bioinformatics
Multiple Sequence Alignment
Xingquan Zhu
Dept. of Computer Science and Engineering
Outline
Multiple Sequence Alignment What, Why, and How Multiple Sequence Alignment Methods
Multidimensional dynamic programming Star Alignment Tree Alignment
Progressive Alignment Clustalw: a widely used algorithm
Iterative Alignment Genetic Algorithm
What is a Multiple Sequence Alignment?
Pairwise alignments: involve two sequences Multiple sequence alignments: involve more than 2
sequences (often 100’s, either nucleotide or protein). A formal definition
A multiple alignment of strings S1, … Sk is a series of strings with spaces such that |S1’| = … = |Sk’|Sj’ is an extension of Sj by insertion of spaces
Goal: Find an optimal multiple alignment.
Hs ---MK----- --LSLVAAML LLLSAARAEE EDKK-EDVGT VVGIDLGTTY
Sp ---MKKFQLF SILSYFVALF LLPMAFASGD DNST-ESYGT VIGIDLGTTY
Tg MTAAKKLSLF SLAALFCLLS VATLRPVAAS DAEEGKVKDV VIGIDLGTTY
Pf --------MN QIRPYILLLI VSLLKFISAV DSN---IEGP VIGIDLGTTY
Why we do multiple alignments?
In order to reveal the relationship between a group of sequences (homology)
Simultaneous alignment of similar gene sequences may Discover the conserved regions in genes Determine the consensus sequence of these aligned
sequences Help defines a protein family that may share a common
biochemical function or evolutionary origin and thus reveals an evolutionary history of the sequences.
Help prediction of the secondary and tertiary structures of new sequences
MSA Methods Multidimensional dynamic programming
Extension of DP to multiple (3) sequences Star Alignment, Tree Alignment, Progressive
Alignment Starting with an alignment of the most alike
sequences and building an alignment by adding more sequences
Iterative methods Making an initial alignment of groups of sequences
and revising the alignment to achieve a more reasonable result
Outline
Multiple Sequence Alignment What, Why, and How Multiple Sequence Alignment Methods
Multidimensional dynamic programming Star Alignment Tree Alignment
Progressive Alignment Clustalw: a widely used algorithm
Iterative Alignment Genetic Algorithm
Multiple Sequence Alignment by DP
Pairwise sequence alignment a scoring matrix where each position provides the
best alignment up to that point Extension to 3 sequences
the lattice of a cube that is to be filled with calculated dynamic programming scores.
Scoring positions on 3 surfaces of the cube represent the alignment
of a pair
Scoring of MSA: Sum of Pairs
Scores = summation of all possible combinations of amino acid pairs
Using BLOSUM62 matrix, gap penalty -8
In column 1, we have pairs -,S -,S S,S
k(k-1)/2 pairs per column
- I K
S I K
S S E
-8 - 8 + 4 = -12
Sum of Pairs Given 5 sequences:
N C C E N N C E N - C N S C S N
S C S E
How many possible combinations of pairwise alignments for each position?
10!3!2
!552
C
Sum of Pairs Assume: match/mismatch/gap = 1/0/-1 N C C E
N N C E N - C N S C S N
S C S EThe 1st position: # of N-N (3), # of S-S (1), # of N-S (6) SP(1) = 4*1 + 0*6 + (-1)*0 = 4 The 2nd position: # of C-C (3), # of N-C (3), # of gaps (4),
SP(2) = 3*1 + 0*3 + (-1)*4 = -1
G T G C T T G A
T
G
G
C
C
T
Dynamic programming matrixPairwise alignment
Gap in sequence 2
Match/Mismatch Gap in sequence 1Seq 1
Seq 2
Multiple sequence alignment
Dynamic programming matrix
many possibilities
S
M
V
S M T
AM
V
Seq 1
Seq 2Seq 3
DP Alignment Examples
All three match/mismatch
Sequence 1 & 2 match/mismatch with gap in 3
Sequence 1 & 3 match/mismatch with gap in 2
Sequence 2 & 3 match/mismatch with gap in 1
Sequence 1 with gaps in 2 & 3
Sequence 2 with gaps in 1 & 3
Sequence 3 with gaps in 1 & 2
Choose the largest value among the above seven possibilities
Computational Complexity
For protein sequences each 300 amino acid in length & excluding gaps, with DP algorithm Two sequences, 3002 comparisons Three sequences, 3003 comparisons N sequences, 300N comparisonsO(LN) L: length of the sequences; N: number of sequences
The number of comparisons & memory required are too large for n > 3 and not practical
Outline
Multiple Sequence Alignment What, Why, and How Multiple Sequence Alignment Methods
Multidimensional dynamic programming Star Alignment Tree Alignment
Progressive Alignment Clustalw: a widely used algorithm
Iterative Alignment Genetic Algorithm
Star Alignments
Heuristic method for multiple sequence alignments Select a sequence sc as the center of the star
For each sequence s1, …, sk such that index i c, perform a global alignment (using DP)
Aggregate alignments with the principle “once a gap, always a gap.”
Star Alignments Example
s2
s1s3
s4
s1: MPEs2: MKEs3: MSKEs4: SKE
MPE
| |
MKE
MSKE
- ||
MKE
MKE
||
SKE MPEMKE
-MPE-MKEMSKE
-MPE-MKEMSKE-SKE
Choosing a center
Try them all and pick the one with the best score
Calculate all O(k2) alignments, and pick the sequence sc that maximizes
ci
ci ssscore ),(
Star Alignment Example
S1=ATTGCCATT S2=ATGGCCATT S3=ATCCAATTTT S4=ATCTTCTT S5=ATTGCCGATT
s1 s2 s3 s4 s5
s1 7 -2 0 -3
s2 7 -2 0 -4
s3 -2 -2 0 -7
s4 0 0 0 -3
s5 -3 -4 -7 -3
21-11
-3-17
Star Alignments Example
Merging Pairwise Alignment
Star Alignment Example
Merging Pairwise Alignment
Analysis
Assuming all sequences have length n O(n2) to calculate global alignment O(k2) global alignments to calculate Using a reasonable data structure for joining
alignments, no worse than O(kl), where l is upper bound on alignment lengths
O(k2n2+kl)=O(k2n2) overall cost
Outline
Multiple Sequence Alignment What, Why, and How Multiple Sequence Alignment Methods
Multidimensional dynamic programming Star Alignment Tree Alignment
Progressive Alignment Clustalw: a widely used algorithm
Iterative Alignment Genetic Algorithm
Tree Alignment Compute the overall similarity based on pairwise
alignment along the edge
The sum of all these weights is the score of the tree
sequence
sequence
sequence S2
sequence S1
weight : sim(s1,s2)
Consensus String
The consensus string derived from multiple alignment is the concatenation of the consensus characters for each
column. The consensus character for column is the character that minimizes the summed distance to it from all
the characters in column
Tree Alignment Example
Scoring system used is
-1 p(a,-)
ba if 0b)p(a,
ba if bap 1),(
CAT
GT
CTG
CG
CAT - GT
CTG3
0
13
1
We have a score of 8
CAT
CTGC - G
Tree Alignment Example
Example
Example
Example
Example
Example
Example
Example
Analysis
We don’t know the correct tree Without the tree, the tree alignment
problem is NP-complete Likely only exponential time solution
available (for optimal answers)
Outline
Multiple Sequence Alignment What, Why, and How Multiple Sequence Alignment Methods
Multidimensional dynamic programming Star Alignment Tree Alignment
Progressive Alignment Clustalw: a widely used algorithm
Iterative Alignment Genetic Algorithm
Progressive Methods
DP-based MSA program is limited in 3 sequences or to a small # of relatively short sequences
Progressive alignments uses DP to build a msa starting with the most related sequences and then progressively adding less-related sequences or groups of sequences to the initial alignment
Most commonly used approach
Progressive Methods
Progressive alignment is heuristic. It does not separate the process of scoring an
alignment from the optimization algorithm It does not directly optimize any global scoring
scoring function of “alignment correctness”. It is fast, efficient and the results are reasonable.
We will illustrate this using ClustalW.
Progressive MSA occurs in 3 stages
1. Do a set of global pairwise alignments (Needleman and Wunsch)
2. Create a guide tree
3. Progressively align the sequences
ClustalW Procedure
Progressive Methods: ClustalW http://www.ebi.ac.uk/clustalw/ ClustalW is a general purpose multiple
alignment program for DNA or proteins. ClustalW: The W standing for “weighting” to
represent the ability of the program to provide weights to the sequence and program parameters.
CLUSTALX provides a graphic interface
Operational options
Output options
Input options, matrix choice, gap opening penalty
Gap information,output tree type
File input in GCG, FASTA, EMBL, GenBank, Phylip, or several other formats
Use Clustal W to do a progressive MSA
Progressive MSA stage 3 of 3 : progressive alignment
Make a MSA based on the order in the guide tree
Start with the two most closely related sequences
Then add the next closest sequence Continue until all sequences are added to the
MSA
Problems w/ Progressive Alignment
Highly sensitive to the choice of initial pair to align. The very first sequences to be aligned are the
most closely related on the sequence tree. If alignment good, few errors in the initial alignment
The more distantly related these sequences, the more errors
Errors in alignment propagated to the MSA
Outline
Multiple Sequence Alignment What, Why, and How Multiple Sequence Alignment Methods
Multidimensional dynamic programming Star Alignment Tree Alignment
Progressive Alignment Clustalw: a widely used algorithm
Iterative Alignment Genetic Algorithm
Iterative Methods
Results do NOT depend on the initial pairwise alignment (recall progressive methods)
Starting with an initial alignment and repeatedly realigning groups of the sequences
Repeat until one MSA doesn’t change significantly from the next.
After iterations, alignments are better and better. An example is genetic algorithm approach.
Genetic Algorithms
A general problem solving method modeled on evolutionary change.
Inspired by the biological evolution process Uses concepts of “Natural Selection” and “Genetic
Inheritance” (Darwin 1859) Create a set of candidate solutions to your problem,
and cause these solutions to evolve and become more and more fit over repeated generations.
Use survival of the fittest, mutation, and crossover to guide evolution.
Genetic Search Algorithms
Random generationRandom generation (candidate solutions)(candidate solutions)
EvaluationEvaluation (fitness (fitness function)function)
SelectionSelection (candidate (candidate solutions with larger solutions with larger
fitness values will have fitness values will have larger chance to be larger chance to be
included)included)
Crossover + MutationCrossover + Mutation (change some selected (change some selected candidate solutions to candidate solutions to
converge to the optimal converge to the optimal solution and to prevent a solution and to prevent a
local extremelocal extreme
Outline
Multiple Sequence Alignment What, Why, and How Multiple Sequence Alignment Methods
Multidimensional dynamic programming Star Alignment Tree Alignment
Progressive Alignment Clustalw: a widely used algorithm
Iterative Alignment Genetic Algorithm