multiple sequence alignments
DESCRIPTION
Multiple Sequence Alignments. It is God’s privilege to conceal things, but the kings’ pride is to research them. (Proverbs 25:2; ascribed to King Solomon of Israel, BC 1000). 1-4, Jan, 2006 Protein Folding Winter School Keehyoung Joo School of Computational Sciences, KIAS , Seoul, Korea. - PowerPoint PPT PresentationTRANSCRIPT
Multiple Sequence Alignments
It is God’s privilege to conceal things, but the kings’ pride is to research them.
(Proverbs 25:2; ascribed to King Solomon of Israel, BC 1000)
1-4, Jan, 2006 Protein Folding Winter School
Keehyoung Joo
School of Computational Sciences, KIAS, Seoul, Korea
The major goal of computational sequence analysis is to predict the structure and function of genes and proteins from their sequence.
Contents How to make your model from sequence ? What is a Multiple Sequence Alignment
(MSA)? How can I use a MSA (Motivation) ? What is the matter of MSA ?
The choice of the sequences The choice of an objective function The optimization of that function
How to make MSA ?
How to make your model from sequence ? Tertiary structure
prediction methods Homology modeling Fold Recognition Ab. Initio method
T TC
R
C
PS I V A
SN
F
Fold DB
Protein Data Bank
Unknown Sequence
Find template folds and alignment
Modeling from templates and alignment
What is a Multiple Sequence Alignment
MSA can be seen as a generalization of Pairwise Sequence Alignment.
How can I use a MSA (Motivation) Clustering, classification,
or categorization of genes/proteins.
Identification of conserved region.
Detecting point mutations.
Deducing evolutionary relationship and phylogenetic tree.
Assist in predicting secondary and tertiary structure.
What is the matter of MSA ? It stands at the cross road of three distinct
technical difficulties.
Unknown Sequence
Choice of the sequences
Choice of an objective function
Database Search
What is a good alignment? (Biology)
Optimization of that function
What is the good alignment? (Computation)
The Choice of the sequences :Sequences sharing a common ancestor
(homologous sequences) PSI-BLAST, FASTA, Various Search Tools
The Choice of an objective functionBiological problem that lies in the definition of
correctness Sum of pair, Entropy score, Consistency based,
… The Optimization of that function
Exact Algorithms (Dynamic Programming) Progressive alignment (ClustalW) Iterative approaches (SA, GA, …)
Example : Sum of pair score
Seq A: ARGTCAGATACGLAG---PGMCTETWV
Seq B: ARATCGGAT---IAGTIYPGMCTHTWV
Scoring substitutions are represented in matrices. The popular ones are PAM or BLOSUM.
L
iiiLL basubBAS
1
),(),(
Sequence alignments
n
iikj
L
k
n
i
n
ijkin AGAAsubAScore
1,
1
1
1 1, )(),()(
Seq A1: ARGTCAGATACGLAG---PGMCTETWV----
Seq A2: ARATCGGAT---IAGTIYPGMCTHTWVIAGQ
Seq A3: ARATCE--TACG--GTI-PGMCTHTWVIA--
bnaAG i )(
Example : Sum of pair score (Cont.)
Multiple Sequence alignments
Exact method : multi-dimensional dynamic programming
-Time complexity O(Ln2n), Space complexity O(Ln)
How to make a MSA (Methods)
Recent research in literature MAFFT (2002) based on fast fourier transform MUSCLE (2004) progressive alignment, pairwise profile
alignment, position specific gap penalty, PROBCONS (2005) progressive alignment, probability table
using HMM, probabilistic consistency-based MSA
Example : Progressive alignment
MSA by adding sequences
Pairwise Alignment
1 + 2
3 + 4
1 + 3
1 + 4
2 + 4
2 + 3
Guide Tree
1
2
3
4
2
3
4
1
Progressive alignment (cont.)
1 2 3 4 5
1
2
3
4
5
Sequence
Distance Matrix:
displays distances of all sequence pairs.
1
45
3
2
Guide Tree
UPGMA (unweighted pair group method of arithmetic averages)
or Neighbour-Joining method
D = 1 - S
UPGMA Clustering (Guide Tree)
d ij 1 2 3 4 51 0 2 6 9 72 0 5 7 73 0 5 44 0 35 0
d ij20
.5
d ij.5
30
55..
d ij u wu 0 6w 0
.8405.
6 0
.5
12
3
5
4
12
3
5
4
12
3
5
4
12
3
5
4
u 3 4 5u 0 5 8 73 0 5 44 0 35 0
u 3 vu 0 5 73 0 4v 0
Progressive alignment (cont.)
Columns - once aligned - are never changed. . . and new gaps are inserted. Depend strongly on pairwise alignments and the intitial starting sequences No guarantee that the global optimal solution will be found. In case of sequences identity less than 25-30%, this approach become much
less reliable.
1
45
3
2
Guide Tree
21
Alignment of alignments
Progressive Alignment: Discussion Strengths:
Speed Progression biologically sensible (aligns using a tree)
Weaknesses: No objective function. No way of quantifying whether or not the
alignment is good Local minimum problem
Consistency based score functionCoffee Score function (Cedric Nortredame) : Given a set of sequences, the optimal MSA is defined as the one that agrees the most with all the possible optimal pair-wise alignments
1
1 1
1
1 1
)(
)(
N
i
N
ijijij
N
i
N
ijijij
total
ALenW
AScoreW
SCOREScore(Aij) = Number of aligned pairs of residues that are shared between Aij and the library.
- do not depend on a specific substitution matrix
- position dependant alignment.
- the most consistent are often closer to the truth
Summary MSAs are essential tools in computational biology
and bioinformatics. They are required for structure /function analysis and structure prediction.
No perfect method exists for assembling a MSA and all the available methods do approximations.
The most commonly used methods for MSA use a progressive alignment algorithm (ClustalW)
Recent progress have focused on the desigh of iterative (Prrp, SAGA) and consistency based methods (T-Coffee, probcons)
MSA applications Profile-profile alignment
Profile: A table that lists the frequencies of each amino acid in each position of MSA. Profile can be used in database searches Find new sequences that match the profile
Improve search sensitivity Improve search accuracy
Example: Profiles Profile: A table that lists the frequencies of each amino
acid in each position of protein sequence. Frequencies are calculated from a MSA containing a domain
of interest Allows us to identify consensus sequence Derived scoring scheme allows us to align a new sequence
to the profile Profile can be used in database searches Find new sequences that match the profile
Profiles also used to compute multiple alignments heuristically Progressive alignment