multiple sequence alignment. terminology n motif: the biological object one attempts to model - a...

59
Multiple Sequence Alignment

Upload: donald-stokes

Post on 23-Dec-2015

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

Multiple Sequence Alignment

Page 2: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation
Page 3: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

Terminology

Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation site etc.

Pattern: a qualitative motif description based on a regular expression-like syntax

Profile: a quantitative motif description - assigns a degree of similarity to a potential match

Page 4: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

Global Alignment

Global algorithms are often not effective for highly diverged sequences and do not reflect the biological reality that two sequences may only share limited regions of conserved sequence.

Sometimes two sequences may be derived from ancient recombination events where only a single functional domain is shared.

Page 5: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

What is Multiple Sequence Alignment (MSA)?

• Multiple sequence alignment (MSA) can be seen as a generalization of Pairwise Sequence Alignment - instead of aligning two sequences, n sequences are aligned simultaneously, where n is > 2

• Definition: A multiple sequence alignment is an alignment of n > 2 sequences obtained by inserting gaps (“-”) into sequences such that the resulting sequences have all length L and can be arranged in a matrix of N rows and L columns where each column represents a homologous position (each column corresponds to a specific residue in the 'prototypical' protein)

Page 6: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

Multiple Sequence Alignment

MSA applies both to nucleotide and amino acid sequences

To construct a multiple alignment, one may have to introduce gaps in sequences at positions where there were no gaps in the corresponding pairwise alignment.

This means that multiple alignments typically contain more gaps than any given pair of aligned sequences

Page 7: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation
Page 8: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

How to optimize alignment algorithms?

Use structural information: reading frame protein structure

Sequence elements are not truly independent but related by phylogenic descent

Sequences often contain highly conserved regions

Page 9: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

Optimize alignment algorithms

Sequences often contain highly conserved regions

These regions can be used for an initial alignment

By analyzing a number of small, independent fragments,

the algorithmic complexity can be drastically reduced!

Page 10: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

Pairwise Alignment

The alignment of two sequences (DNA or protein) is a relatively straightforward computational problem.

Page 11: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

The big-O notation•One of the most important properties of an algorithm is how its execution time increases as the problem is made larger. By a larger problem, we mean more sequences to align, or longer sequences to align.

•This is the so-called algorithmic (or computational) complexity of the algorithm

•There is a notation to describe the algorithmic complexity, called the big-O notation.

•If we have a problem size (number of input data points) n, then an algorithm takes O(n) time if the time increases linearly with n.

•If the algorithm needs time proportional to the square of n, then it is O(n2)

Page 12: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

The big-O notation It is important to realize that an algorithm that

is quick on small problems may be totally useless on large problems if it has a bad O() behavior.

As a rule of thumb one can use the following characterizations, where n is the size of the problem, and c is a constant:

O(c) utopian O(log n) excellent O(n) very good

O(n2) not so good O(n3) pretty bad O(cn) disaster

Page 13: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

The big-O notation

• To compute a N-wise alignment, the algorithmic complexity is something like O(c2n), where c is a constant, and n is the number of sequences.

• This is a big-O disaster!

Page 14: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation
Page 15: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

The best solution is Dynamic Programming.

Page 16: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

Multiple Sequence Alignment

In pairwise alignments, you have a two-dimensional matrix with the sequenceson each axis.

The number of operations required to locate the best “path” through the matrix is approximately proportional to the product of the lengths of the two sequences

A possible general method would be to extend the pairwise alignment method into a simultaneous N-wise alignment, using a complete dynamical-programming algorithm in N dimensions.

Algorithmically, this is not difficult to do

Page 17: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

Dynamic Programming Dynamic Programming is a very general

programming technique. It is applicable when a large search space

can be structured into a succession of stages, such that: the initial stage contains trivial solutions to sub-

problems each partial solution in a later stage can be

calculated by recurring a fixed number of partial solutions in an earlier stage

the final stage contains the overall solution

Page 18: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

Multiple Alignments

In theory, making an optimal alignment between two sequences is computationally straightforward (Smith-Waterman algorithm), but aligning a large number of sequences using the same method is almost impossible.

The problem increases exponentially with the number of sequences involved(the product of the sequence lengths)

Page 19: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation
Page 20: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

Optimal Alignment

For a given group of sequences, there is no single "correct" alignment, only an alignment that is "optimal" according to some set of calculations.

Determining what alignment is best for a given set of sequences is really up to the judgement of the investigator.

Page 21: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

Why we do multiple alignments? In order to characterize protein families,

identify shared regions of homology in a multiple sequence alignment; (this happens generally when a sequence search revealed homologies to several sequences).

Determination of the consensus sequence of several aligned sequences.

Consensus sequences can help to develop a sequence “finger print” which allows the identification of members of distantly related protein family (motifs)

MSA can help us to reveal biological facts about proteins, like analysis of the secondary/tertiary structure)

Page 22: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation
Page 23: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

Why we do multiple alignments?

Crucial for genome sequencing: Random fragments of a large molecule are sequenced

and those that overlap are found by a multiple sequence alignment program.

There should be one correct alignment that corresponds to the genomic sequence rather than a range of possibilities

Sequence may be from one strand of DNA or the other, so complements of each sequence must also be compared

Sequence fragments will usually overlap, but by an unknown amount and in some cases, one sequence may be included within another

All of the overlapping pairs of sequence fragments must be assembled into large composite genome sequence

Page 24: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

An example of Multiple Alignment

VTISCTGSSSNIGAG-NHVKWYQQLPGQLPGVTISCTGTSSNIGS--ITVNWYQQLPGQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGYYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--

Page 25: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

Three Types of Algorithms

Progressive: ClustalW

Iterative: Muscle

Concistency Based: T-Coffee and Probcons

Page 26: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

Progressive Multiple Alignment

The most practical and widely used method in multiple sequence alignment is the hierarchical extensions of pairwise alignment methods.

The principal is that multiple alignments is achieved by successive application of pairwise methods.

Page 27: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

Choosing sequences for alignment

The more sequences to align the better. Don’t include similar (>80%) sequences. Sub-groups should be pre-aligned

separately, and one member of each subgroup should be included in the final multiple alignment.

Page 28: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

Progressive Pairwise Methods

Most of the available multiple alignment programs use some sort of incremental or progressive method that makes pairwise alignments, then adds new sequences one at a time to these aligned groups.

This is an approximate or heuristic method!

Page 29: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation
Page 30: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

Multiple Alignment Method Compare all sequences pairwise. Perform cluster analysis on the pairwise data to

generate a hierarchy for alignment. This may be in the form of a binary tree or a simple ordering

Build the multiple alignment by first aligning the most similar pair of sequences, then the next most similar pair and so on. Once an alignment of two sequences has been made, then this is fixed. Thus for a set of sequences A, B, C, D having aligned A with C and B with D the alignment of A, B, C, D is obtained by comparing the alignments of A and C with that of B and D using averaged scores at each aligned position.

Page 31: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

Gap Penalties In the MSA scoring scheme, a penalty is subtracted

for each gap introduced into an alignment because the gap increases uncertainty into an alignment

The gap penalty is used to help decide whether or not to accept a gap or insertion in an alignment

Biologically, it should in general be easier for a sequence to accept a different residue in a position, rather than having parts of the sequence chopped away or inserted. Gaps/insertions should therefore be more rare than point mutations (substitutions)

In general, the lower the gapping penalties, the more gaps and more identities are detected but this should be considered in relation to biological significance

Most MSA programs allow for an adjustment of gap penalties

Page 32: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

The PILEUP Algorithm

First, PILEUP calculates approximate pairwise similarity scores between all sequences to be aligned, and they are clustered into a dendrogram (tree structure).

Then the most similar pairs of sequences are aligned.

Averages (similar to consensus sequences) are calculated for the aligned pairs.

New sequences and clusters of sequences are added one by one, according to the branching order in the dendrogram.

Page 33: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

Choosing sequences for MSA

As far as possible, try to align sequences of similar length.

Pileup can align sequences of up to 5000 residues, with 2000 gaps (total 7000 characters).

Pileup is a good program only for similar (close) sequences.

Page 34: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

PileUp considerations PileUp does global multiple alignment, and

therefore is good for a group of similar sequences.

PileUp will fail to find the best local region of similarity (such as a shared motif) among distant related sequences.

PileUp always aligns all of the sequences you specified in the input file, even if they are not related.

The alignment can be degraded if some of the sequences are only distantly related.

Page 35: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

PILEUP Considerations

Since the alignment is calculated on a progressive basis, the order of the initial sequences can affect the final alignment.

PILEUP parameters: 2 gap penalties (gap insert and gap extend) and an amino acid comparison matrix.

PILEUP will refuse to align sequences that require too many gaps or mismatches.

PILEUP will take quite a while to align more than about 10 sequences

Page 36: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

CLUSTAL CLUSTAL is a stand-alone (i.e. not integrated

into GCG) multiple alignment program that is superior in some respects to PILEUP

Works by progressive alignment: it aligns a pair of sequences then aligns the next one onto the first pair

Most closely related sequences are aligned first, and then additional sequences and groups of sequences are added, guided by the initial alignments

Uses alignment scores to produce a phylogenetic tree

Page 37: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

CLUSTAL Aligns the sequences sequentially,

guided by the phylogenetic relationships indicated by the tree

Gap penalties can be adjusted based on specific amino acid residues, regions of hydrophobicity, proximity to other gaps, or secondary structure

Is available with a great web interface: http://www.ebi.ac.uk/clustalw/

Also available in Biology Workbench

Page 38: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

Multiple Alignment tools on the Web

There are a variety of multiple alignment tools available for free on the web.

CLUSTAL is available from a number of sites (with a variety of restrictions)

Other algorithms are available too

Page 39: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

Muscle Algorithm: Using The Iteration

Page 40: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

Consistency Based Algorithms: T-Coffee

Gotoh (1990) Iterative strategy using concistency

Martin Vingron (1991) Dot Matrices Multiplications Accurate but too stringeant

Dialign (1996, Morgenstern) Concistency Agglomerative Assembly

T-Coffee (2000, Notredame) Concistency Progressive algorithm

ProbCons (2004, Do) T-Coffee with a Bayesian Treatment

Page 41: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation
Page 42: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

T-Coffee and Consistency…

Page 43: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

T-Coffee and Consistency…

Page 44: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

T-Coffee and Consistency…

Page 45: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

T-Coffee and Consistency…

Page 46: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

T-Coffee and Consistency…

Page 47: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

APPROXIMATEFAST

ACCURATESLOW

Page 48: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

Some URLs

EMBL-EBIhttp://www.ebi.ac.uk/clustalw/

BCM Search Launcher: Multiple Alignment

http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align.html

Multiple Sequence Alignment for Proteins (Wash. U. St. Louis)http://www.ibc.wustl.edu/service/msa/

Page 49: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

Editing and displaying alignments

Sequence editors are used for: manual alignment/editing of sequences visualization of data data management import/export of data graphical enhancement of data for

presentations

Page 50: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

Editing Multiple Alignments There are a variety of tools that can be

used to modify a multiple alignment. These programs can be very useful in

formatting and annotating an alignment for publication.

An editor can also be used to make modifications by hand to improve biologically significant regions in a multiple alignment created by one of the automated alignment programs.

Page 51: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

Displaying a multiple alignment

in GCG There are several programs to display the

multiple alignment prettily. The Pretty program prints sequences with

their columns aligned and can display a consensus for the alignment, allowing you to look at relationships among the sequences.

The PrettyBox program displays the alignment graphically with the conserved regions of the alignment as shaded boxes. The output is in Postscript format.

Page 52: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

Example of PrettyBox Output

Page 53: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

GCG alignment editors

Alignments produced with PILEUP (or CLUSTAL) can be adjusted with LINEUP.

Nicely shaded printouts can be produced with PRETTYBOX

GCG's SeqLab X-Windows interface has a superb multiple sequence editor - the best editor of any kind.

Page 54: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation
Page 55: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

Other editors

The MACAW and SeqVu program for Macintosh and GeneDoc and DCSE for PCs are free and provide excellent editor functionality.

Many “comprehensive” molecular biology programs include multiple alignment functions: MacVector, OMIGA, Vector NTI, and

GeneTool/PepTool all include a built-in version of CLUSTAL

Page 56: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

SeqVu

Page 57: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

CINEMA CINEMA (Colour INteractive Editor

for Multiple Alignments) It is an editor created completely in

JAVA (old browsers beware) It includes a fully functional version of

CLUSTAL, BLAST, and a DotPlot module

http://www.bioinf.man.ac.uk/dbbrowser/CINEMA2.1/

Page 58: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation

Informative Colors By default, the alignment is coloured crudely

according to residue type (proline and glycine have special structural properties, particularly in membrane proteins, so are grouped separately; similarly for cysteine, which is often involved in disulphide bond formation):

Polar positive H, K, R Blue Polar negative D, E Red Polar neutral S, T, N, Q Green Non-polar aliphatic A, V, L, I, M White Non-polar aromatic F, Y, W Purple P, G Brown C Yellow Special characters B, Z, X, - Grey

Page 59: Multiple Sequence Alignment. Terminology n Motif: the biological object one attempts to model - a functional or structural domain, active site, phosphorylation