sequence alignment and phylogenetic analysis
DESCRIPTION
Sequence Alignment and Phylogenetic Analysis. Evolution. Sequence Alignment. AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC. - AG G CTATCAC CT GACC T C CA GG C CGA -- TGCCC --- T AG - CTATCAC -- GACC G C -- GG T CGA TT TGCCC GAC. Definition - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/1.jpg)
Sequence Alignment and Phylogenetic Analysis
![Page 2: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/2.jpg)
Evolution
![Page 3: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/3.jpg)
Sequence Alignment
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
DefinitionGiven two strings x = x1x2...xM, y = y1y2…yN,
an alignment is an assignment of gaps to positions0,…, N in x, and 0,…, N in y, so as to line up each
letter in one sequence with either a letter, or a gapin the other sequence
AGGCTATCACCTGACCTCCAGGCCGATGCCCTAGCTATCACGACCGCGGTCGATTTGCCCGAC
![Page 4: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/4.jpg)
What is a good alignment?AGGCTAGTT, AGCGAAGTTT
AGGCTAGTT- 6 matches, 3 mismatches, 1 gapAGCGAAGTTT
AGGCTA-GTT- 7 matches, 1 mismatch, 3 gapsAG-CGAAGTTT
AGGC-TA-GTT- 7 matches, 0 mismatches, 5 gapsAG-CG-AAGTTT
![Page 5: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/5.jpg)
Scoring Function• Sequence edits:
AGGCCTC
• Mutations AGGACTC
• Insertions AGGGCCTC
• Deletions AGG . CTC
Scoring Function:Match: +mMismatch: -sGap: -d
Score F = (# matches) m - (# mismatches) s – (#gaps) d
![Page 6: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/6.jpg)
G -
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
F(i,j) i = 0 1 2 3 4
Examplex = AGTA m = 1y = ATA s = -1
d = -1
j = 0
1
2
3
F(1, 1) = max{F(0,0) + s(A, A), F(0, 1) – d, F(1, 0) – d} =
max{0 + 1, -1 – 1, -1 – 1} = 1
AA
TT
AA
![Page 7: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/7.jpg)
The Needleman-Wunsch Matrixx1 ……………………………… xMy
1 ……
……
……
……
……
……
yN
Every nondecreasing path
from (0,0) to (M, N)
corresponds to an alignment of the two sequences
An optimal alignment is composed of optimal subalignments
![Page 8: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/8.jpg)
8
Example
H E A G A W G H E E
0 -8 -16
-24
-32
-40
-48
-56
-64
-72
-80
P -8 -2 -9 -17
-25
-33
-42
-49
-57
-65
-73
A -16
W -24
H -32
E -40
A -48
E -56
A E G H W
A 5 -1 0 -2 -3
E -1 6 -3 0 -3
H -2 0 -2 10 -3
P -1 -1 -2 -2 -4
W -3 -3 -3 -3 15
![Page 9: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/9.jpg)
9
H E A G A W G H E E
0 -8 -16
-24
-32
-40
-48
-56
-64
-72
-80
P -8 -2 -9 -17
-25
-33
-42
-49
-57
-65
-73
A -16
-10
-3 -4 -12
-20
-28
-36
-44
-52
-60
W -24
-18
-11
-6 -7 -15
-5 -13
-21
-29
-37
H -32
-14
-18
-13
-8 -9 -13
-7 -3 -11
-19
E -40
-22
-8 -16
-16
-9 -12
-15
-7 3 -5
A -48
-30
-16
-3 -11
-11
-12
-12
-15
-5 2
E -56
-38
-24
-11
-6 -12
-14
-15
-12
-9 1
A E G H W
A 5 -1 0 -2 -3
E -1 6 -3 0 -3
H -2 0 -2 10 -3
P -1 -1 -2 -2 -4
W -3 -3 -3 -3 15
![Page 10: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/10.jpg)
PAMX
• PAMx = PAM1x
• PAM250 = PAM1250
• PAM250 is a widely used scoring matrix:
Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys ... A R N D C Q E G H I L K ...Ala A 13 6 9 9 5 8 9 12 6 8 6 7 ...Arg R 3 17 4 3 2 5 3 2 6 3 2 9Asn N 4 4 6 7 2 5 6 4 6 3 2 5Asp D 5 4 8 11 1 7 10 5 6 3 2 5Cys C 2 1 1 1 52 1 1 2 2 2 1 1Gln Q 3 5 5 6 1 10 7 3 7 2 3 5...Trp W 0 2 0 0 0 0 0 0 1 0 1 0Tyr Y 1 1 2 1 3 1 1 1 3 2 2 1Val V 7 4 4 4 4 4 4 4 5 4 15 10
![Page 11: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/11.jpg)
The Blosum50 Scoring Matrix
![Page 12: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/12.jpg)
Affine Gap Penalties
• In nature, a series of k indels often come as a single event rather than a series of k single nucleotide events:
Normal scoring would give the same score for both alignments
This is more likely.
This is less likely.
![Page 13: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/13.jpg)
Affine gaps(n) = d + (n – 1)e
| | gap gap open extend
To compute optimal alignment,
F(i, j): score of alignment x1…xi to y1…yjifif xi aligns to yj
G(i, j): score ifif xi aligns to a gap after yjH(i, j): score ifif yj aligns to a gap after xi
V(i, j) = best score of alignment x1…xi to y1…yj
de
(n)
![Page 14: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/14.jpg)
Needleman-Wunsch with affine gaps
Initialization:V(i, 0) = d + (i – 1)eV(0, j) = d + (j – 1)e
Iteration:V(i, j) = max{ F(i, j), G(i, j), H(i, j) }
F(i, j) = V(i – 1, j – 1) + s(xi, yj)
V(i, j – 1) – d G(i, j) = max
G(i, j – 1) – e
V(i – 1, j) – d H(i, j) = max
H(i – 1, j) – e
Termination: similar
![Page 15: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/15.jpg)
Pairwise Alignment Tools
![Page 16: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/16.jpg)
Some Typical Dot-plot Comparisons
• Divergent sequences where only a segment is homologous
• Long insertions and deletions• Tandem repeats
• The square shape of the pattern is characteristic of these repeats
![Page 17: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/17.jpg)
Using Dotlet• Dotlet is one of the handiest tools for making
dot plots• Dotlet is a Java applet• Open and download the applet at the following
site: http://myhits.isb-sib.ch/cgi-bin/dotlet• Use Firefox or IE
![Page 18: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/18.jpg)
Window size
Dot plot window
Alignment window
Threshold window for fine tuning
![Page 19: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/19.jpg)
Window size
Dot plot window
Alignment window
Threshold window for fine tuning
![Page 20: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/20.jpg)
Window size
Dot plot window
Alignment window
Threshold window for fine tuning
![Page 21: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/21.jpg)
Looking at Repeated Domains with Dotlet
• The square shape is typical of tandem repeats
• The repeats are not perfect because the sequences have diverged after their duplication
![Page 22: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/22.jpg)
Comparing a Gene and Its Product
• Eukaryotic genes are transcribed into RNA
• The RNA is then spliced to remove the introns’ sequences
• It may be necessary to compare the gene and its product
• Dotlet makes this comparative analysis easy
![Page 23: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/23.jpg)
Lalign and BLAST• Lalign is like a very precise BLAST
• It works on only two sequences at a time
• You must provide both sequences
![Page 24: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/24.jpg)
LaLignhttp://www.ch.embnet.org/software/LALIGN_form.html
![Page 25: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/25.jpg)
![Page 26: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/26.jpg)
Lalign Output• Lalign produces an output
similar to the alignment section of BLAST
• The E-value indicates the significance of each alignment
• Low E-value good alignment
![Page 27: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/27.jpg)
Multiple Alignment
![Page 28: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/28.jpg)
Example
![Page 29: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/29.jpg)
4 Ways of Using MSAs . . .
![Page 30: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/30.jpg)
4 More Ways of Using MSAs
![Page 31: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/31.jpg)
Generalizing the Notion of Pairwise Alignment
• Alignment of 2 sequences is represented as a 2-row matrix• In a similar way, we represent alignment of 3 sequences
as a 3-row matrix
A T _ G C G _ A _ C G T _ A A T C A C _ A
• Score: more conserved columns, better alignment
![Page 32: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/32.jpg)
Aligning Three Sequences• Same strategy as aligning
two sequences• Use a 3-D “”, with each
axis representing a sequence to align
• For global alignments, go from source to sink
source
sink
![Page 33: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/33.jpg)
Architecture of 3-D Alignment Cell(i-1,j-1,k-1)
(i,j-1,k-1)
(i,j-1,k)
(i-1,j-1,k) (i-1,j,k)
(i,j,k)
(i-1,j,k-1)
(i,j,k-1)
![Page 34: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/34.jpg)
Multiple Alignment: Dynamic Programming
• si,j,k = max
(x, y, z) is an entry in the 3-D scoring matrix
si-1,j-1,k-1 + (vi, wj, uk)
si-1,j-1,k + (vi, wj, _ )
si-1,j,k-1 + (vi, _, uk)
si,j-1,k-1 + (_, wj, uk)
si-1,j,k + (vi, _ , _)
si,j-1,k + (_, wj, _)
si,j,k-1 + (_, _, uk)
cube diagonal: no indels
face diagonal: one indel
edge diagonal: two indels
![Page 35: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/35.jpg)
Multiple Alignment: Running Time
• For 3 sequences of length n, the run time is 7n3; O(n3)
• For k sequences, build a k-dimensional Manhattan, with run time (2k-1)(nk); O(2knk)
• Conclusion: dynamic programming approach for alignment between two sequences is easily extended to k sequences but it is impractical due to exponential running time
![Page 36: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/36.jpg)
Sum of Pairs Score(SP-Score)
• Consider pairwise alignment of sequences ai and aj
imposed by a multiple alignment of k sequences • Denote the score of this suboptimal (not necessarily
optimal) pairwise alignment as s*(ai, aj)• Sum up the pairwise scores for a multiple alignment:
s(a1,…,ak) = Σi,j s*(ai, aj)
![Page 37: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/37.jpg)
SP-Score: Examplea1
.ak
ATG-C-AATA-G-CATATATCCCATTT
ji
jik aaSaaS,
*1 ),()...(
2
nPairs of Sequences
A
A A11
1
G
C G1
Score=3 Score = 1 –
Column 1 Column 3
s s*(
To calculate each column:
![Page 38: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/38.jpg)
Multiple Alignment Induces Pairwise Alignments
Every multiple alignment induces pairwise alignments
x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG
Induces:
x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG
![Page 39: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/39.jpg)
Reverse Problem: Constructing Multiple Alignment from Pairwise Alignments
Given 3 arbitrary pairwise alignments:
x: ACGCTGG-C; x: AC-GCTGG-C; y: AC-GC-GAGy: ACGC--GAC; z: GCCGCA-GAG; z: GCCGCAGAG
can we construct a multiple alignment that inducesthem?
![Page 40: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/40.jpg)
Reverse Problem: Constructing Multiple Alignment from Pairwise Alignments
Given 3 arbitrary pairwise alignments:
x: ACGCTGG-C; x: AC-GCTGG-C; y: AC-GC-GAGy: ACGC--GAC; z: GCCGCA-GAG; z: GCCGCAGAG
can we construct a multiple alignment that inducesthem? NOT ALWAYS
Pairwise alignments may be inconsistent
![Page 41: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/41.jpg)
Profile Representation of Multiple Alignment
- A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G
A 1 1 .8 C .6 1 .4 1 .6 .2G 1 .2 .2 .4 1T .2 1 .6 .2- .2 .8 .4 .8 .4
![Page 42: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/42.jpg)
Multiple Alignment: Greedy Approach
• Choose most similar pair of strings and combine into a profile , thereby reducing alignment of k sequences to an alignment of of k-1 sequences/profiles. Repeat
• This is a heuristic greedy method
u1= ACGTACGTACGT…
u2 = TTAATTAATTAA…
u3 = ACTACTACTACT…
…
uk = CCGGCCGGCCGG
u1= ACg/tTACg/tTACg/cT…
u2 = TTAATTAATTAA…
…
uk = CCGGCCGGCCGG…
kk-1
![Page 43: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/43.jpg)
Greedy Approach: Example
• Consider these 4 sequencess1 GATTCAs2 GTCTGAs3 GATATTs4 GTCAGC
![Page 44: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/44.jpg)
Greedy Approach: Example (cont’d)
• There are = 6 possible alignments
2
4
s2 GTCTGAs4 GTCAGC (score = 2)
s1 GAT-TCAs2 G-TCTGA (score = 1)
s1 GAT-TCAs3 GATAT-T (score = 1)
s1 GATTCA--s4 G—T-CAGC(score = 0)
s2 G-TCTGAs3 GATAT-T (score = -1)
s3 GAT-ATTs4 G-TCAGC (score = -1)
![Page 45: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/45.jpg)
Greedy Approach: Example (cont’d)
s2 and s4 are closest; combine:
s2 GTCTGAs4 GTCAGC
s2,4 GTCt/aGa/cA (profile)
s1 GATTCAs3 GATATTs2,4 GTCt/aGa/c
new set of 3 sequences:
![Page 46: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/46.jpg)
Progressive Alignment
• Progressive alignment is a variation of greedy algorithm with a somewhat more intelligent strategy for choosing the order of alignments.
• Progressive alignment works well for close sequences, but deteriorates for distant sequences• Gaps in consensus string are permanent• Use profiles to compare sequences
![Page 47: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/47.jpg)
ClustalW
• Popular multiple alignment tool today• ‘W’ stands for ‘weighted’ (different parts
of alignment are weighted differently).• Three-step process
1.) Construct pairwise alignments2.) Build Guide Tree3.) Progressive Alignment guided by the tree
![Page 48: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/48.jpg)
Step 1: Pairwise Alignment
![Page 49: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/49.jpg)
Step 2: Guide Tree
• Create Guide Tree using the similarity matrix
• ClustalW uses the neighbor-joining method
• Guide tree roughly reflects evolutionary relations
![Page 50: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/50.jpg)
![Page 51: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/51.jpg)
Step 3: Progressive Alignment• Start by aligning the two most similar
sequences• Following the guide tree, add in the next
sequences, aligning to the existing alignment• Insert gaps as necessary
![Page 52: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/52.jpg)
![Page 53: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/53.jpg)
Multiple Alignment: History
1975 SankoffFormulated multiple alignment problem and gave dynamic programming solution
1988 Carrillo-LipmanBranch and Bound approach for MSA
1990 Feng-DoolittleProgressive alignment
1994 Thompson-Higgins-Gibson-ClustalWMost popular multiple alignment program
1998 Morgenstern et al.-DIALIGNSegment-based multiple alignment
2000 Notredame-Higgins-Heringa-T-coffeeUsing the library of pairwise alignments
2004 MUSCLE
![Page 54: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/54.jpg)
Practice of MSA
![Page 55: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/55.jpg)
Choosing the Right Sequences• When building an alignment, it is your job to select the sequences• Two main factors when selecting sequences:
• Number of sequences• Nature of the sequences
• A reasonable number of sequences: 20 to 50• Ideal for most methods• Small alignments are easy to display and analyze
• Types of sequences• Well-selected sequences informative alignment
![Page 56: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/56.jpg)
Some Guidelines for Choosing the Right Sequences
![Page 57: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/57.jpg)
DNA or Proteins?• DNA sequences are harder to align than proteins
• DNA-comparison models are less sophisticated
• Most methods work for both DNA and proteins • The results are less useful for DNA
• If your DNA is coding, work on the translated proteins
• If sequences are homologous . . .• Along their entire length use progressive alignment methods (next slide)• In terms of local similarity use motif-discovery methods (end of chapter)
![Page 58: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/58.jpg)
Choosing Sequences That Are Different Enough
• An alignment is useful if . . .• The sequences are correctly aligned • It can be used to produce trees, profiles, and structure predictions
• To obtain this result, the sequences must be• Not too similar • Not too different
• Sequences that are very similar . . .• Are easy to align correctly• Are not informative useless trees and profiles, bad predictions
• Sequences that are very different . . . • Are difficult to align • Are very informative good trees and profiles, good predictions
![Page 59: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/59.jpg)
Steps
• Gathering right sequences• Compute MSA using servers/local programs• Evaluate the results visually• If it is hard to interpret
• Closer examination, remove trouble makers• Redo and trim if needed
![Page 60: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/60.jpg)
Gathering Sequences with BLAST
• The most convenient way to select your sequences is to use a BLAST server
• Some BLAST servers are integrated with multiple-alignment methods:• www.expasy.ch (protein only)• srs.ebi.ac.uk (DNA/protein)• npsa-pbil.ibcp.fr
![Page 61: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/61.jpg)
Gathering Sequences with BLAST
• Select some of the top sequences
• Evenly select some sequences down to the bottom
• The idea is to have many intermediate sequences
![Page 62: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/62.jpg)
ExPASY• www.expasy.ch/tools/blast
![Page 63: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/63.jpg)
![Page 64: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/64.jpg)
![Page 65: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/65.jpg)
![Page 66: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/66.jpg)
![Page 67: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/67.jpg)
![Page 68: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/68.jpg)
>sp|P20472|PRVA_HUMAN Parvalbumin alpha OS=Homo sapiens GN=PVALB PE=1 SV=2 MSMTDLLNAEDIKKAVGAFSATDSFDHKKFFQMVGLKKKSADDVKKVFHMLDKDKSGFIE EDELGFILKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES >sp|P80079|PRVA_FELCA Parvalbumin alpha OS=Felis catus GN=PVALB PE=1 SV=2 MSMTDLLGAEDIKKAVEAFTAVDSFDYKKFFQMVGLKKKSPDDIKKVFHILDKDKSGFIE EDELGFILKGFYPDARDLSVKETKMLMAAGDKDGDGKIDVDEFFSLVAKS >sp|P02627|PRVA_RANES Parvalbumin alpha OS=Rana esculenta PE=1 SV=1 PMTDLLAAGDISKAVSAFAAPESFNHKKFFELCGLKSKSKEIMQKVFHVLDQDQSGFIEK EELCLILKGFTPEGRSLSDKETTALLAAGDKDGDGKIGVDEFVTLVSES >sp|P02626|PRVA_AMPME Parvalbumin alpha OS=Amphiuma means PE=1 SV=1 SMTDVIPEADINKAIHAFKAGEAFDFKKFVHLLGLNKRSPADVTKAFHILDKDRSGYIEE EELQLILKGFSKEGRELTDKETKDLLIKGDKDGDGKIGVDEFTSLVAES >sp|P02619|PRVB_ESOLU Parvalbumin beta OS=Esox lucius PE=1 SV=1 SFAGLKDADVAAALAACSAADSFKHKEFFAKVGLASKSLDDVKKAFYVIDQDKSGFIEED ELKLFLQNFSPSARALTDAETKAFLADGDKDGDGMIGVDEFAAMIKA >sp|P43305|PRVU_CHICK Parvalbumin, thymic CPV3 OS=Gallus gallus PE=1 SV=2 MSLTDILSPSDIAAALRDCQAPDSFSPKKFFQISGMSKKSSSQLKEIFRILDNDQSGFIE EDELKYFLQRFECGARVLTASETKTFLAAADHDGDGKIGAEEFQEMVQS >sp|Q91482|PRVB1_SALSA Parvalbumin beta 1 OS=Salmo salar PE=1 SV=1 MACAHLCKEADIKTALEACKAADTFSFKTFFHTIGFASKSADDVKKAFKVIDQDASGFIE VEELKLFLQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ >sp|P02620|PRVB_MERME Parvalbumin beta OS=Merluccius merluccius PE=1 SV=1 AFAGILADADITAALAACKAEGSFKHGEFFTKIGLKGKSAADIKKVFGIIDQDKSDFVEE DELKLFLQNFSAGARALTDAETATFLKAGDSDGDGKIGVEEFAAMVKG >sp|P02622|PRVB_GADCA Parvalbumin beta OS=Gadus callarias PE=1 SV=1 AFKGILSNADIKAAEAACFKEGSFDEDGFYAKVGLDAFSADELKKLFKIADEDKEGFIEE DELKLFLIAFAADLRALTDAETKAFLKAGDSDGDGKIGVDEFGALVDKWGAKG
![Page 69: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/69.jpg)
If Know Protein Sequences• www.expasy.ch/sprot/sprot-retrieve-
list.html
![Page 70: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/70.jpg)
Aligning Your Sequences
• Aligning sequences correctly is very difficult• It’s hard to align protein sequences with less than 25%
identity (70% identity for DNA)
• All methods are approximate• Alignment methods use the progressive algorithm
• Compares the sequences two by two• Builds a guide tree• Aligns the sequences in the order indicated by the tree
![Page 71: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/71.jpg)
Selecting a Method
• Many alternative methods exist for MSAs
• Most of them use the progressive algorithm• They all are approximate methods• None is guaranteed to deliver the best alignments
• All existing methods have pros and cons• ClustalW is the most popular (21,000 citations)• T-Coffee and ProbCons are more accurate but slower• MUSCLE is very fast, ideal for very large datasets
![Page 72: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/72.jpg)
Selecting a Method (cont’d.)• It’s impossible to guess in advance which method will do
best.• Accuracy is merely an average estimation
• Methods are tested on reference datasets• Their accuracy is the average accuracy obtained on the reference
• The most accurate method can always be outperformed by a less accurate method on a given dataset.
• An alternative: Use consensus methods such as MCOFFEE
![Page 73: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/73.jpg)
ClustalW
• www.ebi.ac.uk/clustalw• pir.georgetown.edu/pirwww/search/
multialn.shtml• www.ddbj.nig.ac.jp/search/clustalw-e.html
![Page 74: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/74.jpg)
![Page 75: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/75.jpg)
![Page 76: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/76.jpg)
![Page 77: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/77.jpg)
![Page 78: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/78.jpg)
Tcoffee
• TCOFFEE: www.tcoffee.org• CORE: evaluate MSA• MCOFFEE: run many and combine• EXPRESSO: with structural information
![Page 79: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/79.jpg)
![Page 80: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/80.jpg)
![Page 81: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/81.jpg)
![Page 82: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/82.jpg)
Running Many Methods at Once• MCOFFEE is a a meta-method
• It runs all the individual MSA methods• It gathers all the produced MSAs• It combines the MSAs into a single MSA
• MCOFFEE is more accurate than any individual method
• Its color output lets you estimate the reliability of your MSA
• MCOFFEE is available on www.tcoffee.org
![Page 83: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/83.jpg)
MCOFFEE Color Output• Red and orange residues are probably well
aligned• Yellow should be treated with caution• Green and blue are probably incorrectly aligned
![Page 84: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/84.jpg)
MCOFFEE
![Page 85: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/85.jpg)
TCOFFEE
![Page 86: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/86.jpg)
TCOFFEE Results
![Page 87: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/87.jpg)
Interpreting Your MSA• Don’t put blind trust in the output of the servers
• Specialists always edit their MSAs by hand
• You must always estimate the biological accuracy of your MSA• Use the color code of Tcoffee • Use the conservation patterns of ClustalW:
• ‘*’ Completely conserved position• ‘:’ Highly conserved position • ‘.’ Conserved position
• Use experimental knowledge of your proteins
![Page 88: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/88.jpg)
Understanding Conserved Positions
![Page 89: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/89.jpg)
Finding Information from Alignment
• Conserved regions• Insert/delete• Phylogenetic Reconstruction• Motif• …
![Page 90: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/90.jpg)
>sp|P02586|TNNC2_RABIT Troponin C, skeletal muscle OS=Oryctolagus cuniculus GN=TNNC2 PE=1 SV=2 MTDQQAEARSYLSEEMIAEFKAAFDMFDADGGGDISVKELGTVMRMLGQTPTKEELDAII EEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEIF RASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQ>sp|P20472|PRVA_HUMAN Parvalbumin alpha OS=Homo sapiens GN=PVALB PE=1 SV=2 MSMTDLLNAEDIKKAVGAFSATDSFDHKKFFQMVGLKKKSADDVKKVFHMLDKDKSGFIE EDELGFILKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES >sp|P80079|PRVA_FELCA Parvalbumin alpha OS=Felis catus GN=PVALB PE=1 SV=2 MSMTDLLGAEDIKKAVEAFTAVDSFDYKKFFQMVGLKKKSPDDIKKVFHILDKDKSGFIE EDELGFILKGFYPDARDLSVKETKMLMAAGDKDGDGKIDVDEFFSLVAKS >sp|P02627|PRVA_RANES Parvalbumin alpha OS=Rana esculenta PE=1 SV=1 PMTDLLAAGDISKAVSAFAAPESFNHKKFFELCGLKSKSKEIMQKVFHVLDQDQSGFIEK EELCLILKGFTPEGRSLSDKETTALLAAGDKDGDGKIGVDEFVTLVSES >sp|P02626|PRVA_AMPME Parvalbumin alpha OS=Amphiuma means PE=1 SV=1 SMTDVIPEADINKAIHAFKAGEAFDFKKFVHLLGLNKRSPADVTKAFHILDKDRSGYIEE EELQLILKGFSKEGRELTDKETKDLLIKGDKDGDGKIGVDEFTSLVAES >sp|P02619|PRVB_ESOLU Parvalbumin beta OS=Esox lucius PE=1 SV=1 SFAGLKDADVAAALAACSAADSFKHKEFFAKVGLASKSLDDVKKAFYVIDQDKSGFIEED ELKLFLQNFSPSARALTDAETKAFLADGDKDGDGMIGVDEFAAMIKA >sp|P43305|PRVU_CHICK Parvalbumin, thymic CPV3 OS=Gallus gallus PE=1 SV=2 MSLTDILSPSDIAAALRDCQAPDSFSPKKFFQISGMSKKSSSQLKEIFRILDNDQSGFIE EDELKYFLQRFECGARVLTASETKTFLAAADHDGDGKIGAEEFQEMVQS >sp|Q91482|PRVB1_SALSA Parvalbumin beta 1 OS=Salmo salar PE=1 SV=1 MACAHLCKEADIKTALEACKAADTFSFKTFFHTIGFASKSADDVKKAFKVIDQDASGFIE VEELKLFLQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ >sp|P02620|PRVB_MERME Parvalbumin beta OS=Merluccius merluccius PE=1 SV=1 AFAGILADADITAALAACKAEGSFKHGEFFTKIGLKGKSAADIKKVFGIIDQDKSDFVEE DELKLFLQNFSAGARALTDAETATFLKAGDSDGDGKIGVEEFAAMVKG >sp|P02622|PRVB_GADCA Parvalbumin beta OS=Gadus callarias PE=1 SV=1 AFKGILSNADIKAAEAACFKEGSFDEDGFYAKVGLDAFSADELKKLFKIADEDKEGFIEE DELKLFLIAFAADLRALTDAETKAFLKAGDSDGDGKIGVDEFGALVDKWGAKG
![Page 91: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/91.jpg)
![Page 92: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/92.jpg)
When Sequences Are Hard to Align
• Most MSA programs assume your sequences are related along their whole length
• When this assumption is not true, the progressive approach will not work
• The only alternative is to compare multiple sequences locally
![Page 93: Sequence Alignment and Phylogenetic Analysis](https://reader030.vdocuments.mx/reader030/viewer/2022033100/56814d84550346895dbae23c/html5/thumbnails/93.jpg)
Local Multiple-Comparison Methods
• Gibbs Sampler• Will make a local multiple alignment• Will ignore unrelated segments of your sequences• Ideal for finding DNA patterns such as promoters
• Motif discovery methods• Will look for motifs conserved in your sequences• The sequences do not need to be aligned
• The most popular motif-discovery methods:• TEIRESIAS, MEME, SMILE, PRATT