chapter 5 multiple sequence alignment. multiple alignment is an extension of pairwise alignment...

17
Chapter 5 Multiple Sequence Alignment

Upload: brittany-rose

Post on 25-Dec-2015

227 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Chapter 5 Multiple Sequence Alignment. Multiple alignment is an extension of pairwise alignment where multiple sequences are aligned This alignment provides

Chapter 5

Multiple Sequence Alignment

Page 2: Chapter 5 Multiple Sequence Alignment. Multiple alignment is an extension of pairwise alignment where multiple sequences are aligned This alignment provides

•Multiple alignment is an extension of pairwise alignment where multiple sequences are aligned•This alignment provides insights not possible in pairwise alignments, such as

•Conserved sequence patterns•Conserved and functionally critical amino acid residues•Prerequisite for phylogenetic analyses•Prediction of protein secondary and tertiary structures•Design of degenerate PCR primers

Page 3: Chapter 5 Multiple Sequence Alignment. Multiple alignment is an extension of pairwise alignment where multiple sequences are aligned This alignment provides

Scoring Function

•The purpose of multiple alignment is to line up sequences in a way so that a maximum number of residues from each sequence are matched according to a scoring function•The scoring function is generally based on “sum of pairs” (SP)•The SP is the sum of all pairwise scores for all residues in the alignment

Sequence 1: G K NSequence 2: T R NSequence 3: S H E

G:T = 1 K:R=2 N:N=6 T:S = 1 R:H=0 N:E=0 G:S = 0 K:H=-1 N:E=0Total:2 + 1 + 6 = 9

C S T P A G N D E Q H R K M I L V F Y W C 9 -1 -1 -3 0 -3 -3 -3 -4 -3 -3 -3 -3 -1 -1 -1 -1 -2 -2 -2 S -1 4 1 -1 1 0 1 0 0 0 -1 -1 0 -1 -2 -2 -2 -2 -2 -3 T -1 1 4 1 -1 1 0 1 0 0 0 -1 0 -1 -2 -2 -2 -2 -2 -3 P -3 -1 1 7 -1 -2 -1 -1 -1 -1 -2 -2 -1 -2 -3 -3 -2 -4 -3 -4 A 0 1 -1 -1 4 0 -1 -2 -1 -1 -2 -1 -1 -1 -1 -1 -2 -2 -2 -3 G -3 0 1 -2 0 6 -2 -1 -2 -2 -2 -2 -2 -3 -4 -4 0 -3 -3 -2 N -3 1 0 -2 -2 0 6 1 0 0 -1 0 0 -2 -3 -3 -3 -3 -2 -4 D -3 0 1 -1 -2 -1 1 6 2 0 -1 -2 -1 -3 -3 -4 -3 -3 -3 -4 E -4 0 0 -1 -1 -2 0 2 5 2 0 0 1 -2 -3 -3 -3 -3 -2 -3 Q -3 0 0 -1 -1 -2 0 0 2 5 0 1 1 0 -3 -2 -2 -3 -1 -2 H -3 -1 0 -2 -2 -2 1 1 0 0 8 0 -1 -2 -3 -3 -2 -1 2 -2 R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5 2 -1 -3 -2 -3 -3 -2 -3 K -3 0 0 -1 -1 -2 0 -1 1 1 -1 2 5 -1 -3 -2 -3 -3 -2 -3 M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5 1 2 -2 0 -1 -1 I -1 -2 -2 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4 2 1 0 -1 -3 L -1 -2 -2 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4 3 0 -1 -2 V -1 -2 -2 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4 -1 -1 -3 F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6 3 1 Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7 2 W -2 -3 -3 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11

Thus 29 = 512 times more likely than by random chance

Blosum62 substitution matrix

Page 4: Chapter 5 Multiple Sequence Alignment. Multiple alignment is an extension of pairwise alignment where multiple sequences are aligned This alignment provides

Exhaustive Algorithms

Brute Force Algorithm

Similar to dynamic programming algorithms that searches for the best solution, examining every possible solutionIn pairwise alignment use a 2D matrixFor N sequences, use an N-dimensional matrixNumber of calculations increase exponentially (N×N×N×N×…)Generally only useful for <=10 short sequences

Divide and Conquer Alignment (DCA)

Identify regional similarities in multiple sequencesDo a brute force alignment of the similar regionsJoin the independently aligned regionshttp://bibiserv.techfak.uni-bielefeld.de/dca/

Page 5: Chapter 5 Multiple Sequence Alignment. Multiple alignment is an extension of pairwise alignment where multiple sequences are aligned This alignment provides
Page 6: Chapter 5 Multiple Sequence Alignment. Multiple alignment is an extension of pairwise alignment where multiple sequences are aligned This alignment provides

Heuristic Algorithm

Progressive Alignment Method

•Pairwise alignment by Needleman-Wunsch of all pairs•Records similarity scores of aligned pairs•Scores entered into matrix•Guide tree constructed that reflects similarity between aligned pairs•Most closely related sequences re-aligned with Needleman-Wunsch•Different substitution matrices are selected depending on evolutionary distance between sequences to be aligned•Aligned pair converted to “consensus sequence” with fixed gaps•Consensus sequences treated as ordinary sequence for next step which is pairwise alignment with most related sequence in guide tree•Next “consensus sequence” is calculated and process repeated until all sequences are aligned•Most famous: clustalW (command line) clustalX (GUI)•http://www.ebi.ac.uk/Tools/clustalw2/index.html

Page 7: Chapter 5 Multiple Sequence Alignment. Multiple alignment is an extension of pairwise alignment where multiple sequences are aligned This alignment provides

Download and install clustW from

ftp://ftp.ebi.ac.uk/pub/software/clustalw2/2.0.9/

Spend a few minutes entering sequences and doing alignments

Page 8: Chapter 5 Multiple Sequence Alignment. Multiple alignment is an extension of pairwise alignment where multiple sequences are aligned This alignment provides

•ClustalW uses gap penalties that is context sensitive:•Gaps count more close to runs of hydrophobic amino acids (more likely to be in internal conserved regions of a protein) compared to next to hydrophilic regions or G, likely to be on the outside in loops•Weighing scheme: closely related sequences are given a lower weighting score•The weighting score is dependent upon the branch length divided by the number of shared branches•This has the effect of minimizing a possible dominating effect of common sequences

Page 9: Chapter 5 Multiple Sequence Alignment. Multiple alignment is an extension of pairwise alignment where multiple sequences are aligned This alignment provides

Drawbacks and Solutions

•Based on global alignment – thus only sequences of similar length can be aligned•Long gaps required for alignment of dissimilar sequence length penalized•“Greedy” algorithm – once gaps are introduced, they stay in subsequence consensus sequences

Page 10: Chapter 5 Multiple Sequence Alignment. Multiple alignment is an extension of pairwise alignment where multiple sequences are aligned This alignment provides

T-Coffee

•Tree-based Consistency Objective Function for alignment Evaluation•http://www.ebi.ac.uk/Tools/t-coffee/•http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi•Performs global alignment with clustal•Local pairwise alignment with Lalign•Global and ten best local alignments are pooled to form a library•All pairwise alignments are then aligned with a third possible sequence•Distance matrix calculated to build a guide tree•Guide tree used for final multiple alignment•Does not get” stuck” in sub-optimal initial alignments•Slower than clustal

Page 11: Chapter 5 Multiple Sequence Alignment. Multiple alignment is an extension of pairwise alignment where multiple sequences are aligned This alignment provides

dbClustal

•First performs BLASTP search for a query sequence•Aligned pairs are analyzed to obtain anchor points (local conserved regions) using a program called Ballast•Global alignment generated by Clustal, weighed to anchor points•Initial local alignment minimizes errors in divergent sequences•Multiple alignment subsequently evaluated by NorMD which removes poorly aligned sequences•http://bips.u-strasbg.fr/PipeAlign/jump_to.cgi?DbClustal+noid

Page 12: Chapter 5 Multiple Sequence Alignment. Multiple alignment is an extension of pairwise alignment where multiple sequences are aligned This alignment provides

Partial Order Alignment (POA)

•http://bioinformatics.ucla.edu/poa/•Multiple alignments performed on more and more sequences from a list•Identical residues condensed to nodes•Each new sequence aligned with each sequence of the graph model•Eliminates the problem of error fixation•Faster and more accurate than clustal

Page 13: Chapter 5 Multiple Sequence Alignment. Multiple alignment is an extension of pairwise alignment where multiple sequences are aligned This alignment provides

PRALINE

•http://zeus.cs.vu.nl/programs/pralinewww/•Builds profiles of sequences to be aligned•Profiles generated by PSI-BLAST•Because profiles contain information on close relatives, divergent sequences are more accurately aligned•Program can incorporate secondary protein structure•Very sophisticated but very slow

Page 14: Chapter 5 Multiple Sequence Alignment. Multiple alignment is an extension of pairwise alignment where multiple sequences are aligned This alignment provides

Iterative Alignment

PRRN

•Find optimal solution by iteratively modifying sub-optimal solutions•http://prrn.ims.u-tokyo.ac.jp/•Multiple alignment is performed on whole group of sequences•Sequences randomly distributed into two groups•Dynamic programming applied to consensus sequences derived from each group•The random split is repeated and another round of dynamic programming alignment performed•This is repeated until the alignment score no longer increases•A multiple alignment of the sequences are then again performed•Process repeated until multiple alignment score no longer improves

Page 15: Chapter 5 Multiple Sequence Alignment. Multiple alignment is an extension of pairwise alignment where multiple sequences are aligned This alignment provides

Iterative Alignment

DIALIGN2

•http://mobyle.pasteur.fr/cgi-bin/MobylePortal/portal.py?form=dialign•Breaks all sequences down into segments, and performs alignment between segments•High-scoring segments are progressively assembled into larger and larger sequences•The score of an alignment is calculated from the block and not from individual residues•Sequence regions between block are left unaligned•Very suited to alignment of divergent sequences

Page 16: Chapter 5 Multiple Sequence Alignment. Multiple alignment is an extension of pairwise alignment where multiple sequences are aligned This alignment provides

Practical Issues

•DNA alignments are only based on 4 nucleotides, and are less reliable than protein sequence alignments•Alignments of DNA sequence does not consider functional issues, suchas gene boundaries•Insertion of gaps may “break” codons or cause frameshift that will not be tolerated in the protein, and is functional nonsense•Thus, always better toalign protein sequences•Possible to convert DNA to amino acid sequence, then align, and then decode back to DNA

•RevTrans (http://www.cbs.dtu.dk/services/RevTrans/)•PROTA2DNA (missing link…)

Page 17: Chapter 5 Multiple Sequence Alignment. Multiple alignment is an extension of pairwise alignment where multiple sequences are aligned This alignment provides

Editing and Format

•Most alignment programs require final editing by a human to ensure that there are no problems in functionality•Finding badly aligned regions•Removing non-sensical gaps etc.•http://www.mbio.ncsu.edu/bioEdit/bioedit.html

•Need to convert one sequence format to another: http://iubio.bio.indiana.edu/cgi-bin/readseq.cgi/