multiple sequence alignment. vtisctgsssnigag nhvkwyqqlpg vtisctgtssnigs itvnwyqqlpg lrlscsssgfifss...

23
Multiple Sequence Alignment

Upload: anastasia-shields

Post on 25-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Multiple Sequence Alignment

Multiple Sequence Alignment

VTISCTGSSSNIGAGNHVKWYQQLPGVTISCTGTSSNIGSITVNWYQQLPGLRLSCSSSGFIFSSYAMYWVRQAPGLSLTCTVSGTSFDDYYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDGATLVCLISDFYPGAVTVAWKADSATLVCLISDFYPGAVTVAWKADSAALGCLVKDYFPEPVTVSWNSG-VSLTCLVKGFYPSDIAVEWESNG-

• Goal: Bring the greatest number of similar characters into the same column of the alignment. (Similar to alignment of two sequences)

• Definition: In an MSA, homologous residues among a set of sequences are aligned together in columns. Homologous is meant in both the structural and evolutionary sense. Ideally, a column of aligned residues occupy similar three-dimensional structural positions and all diverge from a common ancestral residue.

• Except for trivial cases of highly identical sequences, it is not possible to unambiguously identify structurally or evolutionarily homologous positions and create a single correct multiple alignment.

Multiple Sequence Alignment: Motivation

• Correspondence. Find out which parts “do the same thing”

– Similar genes are conserved across widely divergent species, often performing similar functions

• Structure prediction

– Use knowledge of structure of one or more members of a protein MSA to predict structure of other members

– Structure is more conserved than sequence

• Create “profiles” for protein families

– Allow us to search for other members of the family

• Genome assembly: Automated reconstruction of “contig” maps of genomic fragments such as ESTs

• MSA is the starting point for phylogenetic analysis

Multiple Sequence Alignment: Approaches

• Optimal Global Alignments -Dynamic programming

– Generalization of Needleman-Wunsch

– Find alignment that maximizes a score function

– Computationally expensive: Time grows as product of sequence lengths

• Global Progressive Alignments - Match closely-related sequences first using a

guide tree

• Global Iterative Alignments - Multiple re-building attempts to find best alignment

• Local alignments

– Profiles, Blocks, Patterns

Sum-of-Pairs/SP Score Function

Score of multiple alignment

= ∑i <j score(Si,Sj)

where score(Si,Sj) = score of induced pair-wise alignment

QuickTime™ and a decompressor

are needed to see this picture.

Materials & Methods• Group of related proteins & how to find them: Protein sequences can be related

by homology (common ancestor) or convergence (proteins independently evolve to have common sequence features that typically perform a common function or have a common structure).

• How many sequences are needed?: More is better.

• Redundancy: Redundant sequences will bias the alignment toward their own features. A good rule of thumb in cases of varied similarities is to have only a single

sequence from each set with more than 70-80% intra-sequence similarity.

MSA: Different Methods

• Exact methods• Progressive (ClustalW)• Iterative (MUSCLE)• Consistency (ProbCons)• Structure-based (Expresso)

Progressive Multiple Sequence Alignment

• Compare all sequences pairwise. • Perform cluster analysis on the pairwise data to generate a hierarchy

for alignment. This may be in the form of a binary tree or a simple ordering

• Build the multiple alignment by first aligning the most similar pair of sequences, then the next most similar pair and so on. Once an alignment of two sequences has been made, then this is fixed. Thus for a set of sequences A, B, C, D having aligned A with C and B with D the alignment of A, B, C, D is obtained by comparing the alignments of A and C with that of B and D using averaged scores at each aligned position.

Steps in Multiple Alignment

CLUSTALW MSA

MSA of four oxidoreductase NAD binding domain protein sequences. Red: AVFPMILW. Blue: DE. Magenta: RHK. Green: STYHCNGQ. Grey: all others. Residue ranges are shown after sequence names.

Multiple sequence alignment methods

Iterative methods: compute a sub-optimal solution and keep modifying that intelligently using dynamic programming or other methods until the solution converges.

Examples: MUSCLE, IterAlign, Praline, MAFFT

MUSCLE: next-generation progressive MSA

[1] Build a draft progressive alignment

• Determine pairwise similarity through k-mer counting (not by alignment)

• Compute distance (triangular distance) matrix

• Construct tree using UPGMA

• Construct draft progressive alignment following tree

Page 191

MUSCLE: next-generation progressive MSA

[2] Improve the progressive alignment

• Compute pairwise identity through current MSA

• Construct new tree with Kimura distance measures

• Compare new and old trees: if improved, repeat this step, if not improved, then we’re done

Page 191

MUSCLE: next-generation progressive MSA

[3] Refinement of the MSA

• Split tree in half by deleting one edge

• Make profiles of each half of the tree

• Re-align the profiles

• Accept/reject the new alignment

MUSCLE: next-generation progressive MSA

[3] Refinement of the MSA

• Split tree in half by deleting one edge

• Make profiles of each half of the tree

• Re-align the profiles

• Accept/reject the new alignment

Iterative approaches: MAFFT

•Uses Fast Fourier Transform to speed up profile alignment•Uses fast two-stage method for building alignments using k-mer frequencies•Offers many different scoring and aligning techniques•One of the more accurate programs available•Available as standalone or web interface•Many output formats, including interactive phylogenetic trees

Page 190

MSA: Consistency Approach

• Consistency-based algorithms: generally use a database of both local high-scoring alignments and long-range global alignments to create a final alignment

• These are very powerful, very fast, and very accurate methods

• Examples: T-COFFEE, Prrp, DiAlign, ProbCons

Page 192

ProbCons—consistency-based approach

•Combines iterative and progressive approaches with a unique probabilistic model.

•Uses Hidden Markov Models to calculate probability matrices for matching residues, uses this to construct a guide tree

•Progressive alignment hierarchically along guide tree

•Post-processing and iterative refinement (a little like MUSCLE)

Structure Based MSA

• Structure based alignment is more informative and give us more efficient information about Proteins.

o 3D structure is more conserved then simple amino acid sequence. o Tertiary protein structure evolves more slowly than primary structure. So it is

possible to improve the accuracy of alignment by including information about the three dimensional structure of protein.

The structure of Protein is very complex. So Protein structural alignment is more time consuming and complicated.

Structure Based MSA: 3D-Coffee and Expresso

• 3D-Coffee is a special mode of T-Coffee that uses structural information. In 3D-Coffee the original sequences are replaced by homologous structures (templates). The templates are aligned using strutural aligners like sap or TMalign and these structure based alignments are used as a template for realigning the original sequences. Provided the right templates have been found, the result is a structure based multiple sequence alignment. If some of the sequences do not have any template, they are treated like regular sequences and aligned with the appropriate methods.

• Expresso is an extension of 3D-Coffee where the structural templates are automatically identified with BLAST.

QuickTime™ and a decompressor

are needed to see this picture.

Structure Based MSA: 3D-Coffee and Expresso

Protein Data Bank accession numbers

Two Extremes of MSA

• Recently diverged sequences from each other did not change much. Hence these will be very similar across the length. So, we cannot tell whether distant regions that are conserved due to their importance from regions that did not have time to diverge.

• Sequences that are very diverged from each other (or perhaps are

not related to begin with) and have no similar sequence regions. It

still is possible that the proteins are related and have corresponding

regions, but we cannot identify these by their sequence. This can

change if we have more data (more sequences and/or experimental

data) or use better alignment methods.