multiple sequence comparison. /course/eleg667-01-f/topic-2c2 outline motivation multiple sequence...
Post on 19-Dec-2015
218 views
TRANSCRIPT
Multiple Sequence Comparison
/course/eleg667-01-f/Topic-2c 2
Outline Motivation Multiple Sequence Alignment using Dynamic
programming Multiple Sequence Alignment using Heuristics
Star Alignments Tree Alignments (CLUSTAL W)
PSI-BLAST and multiple sequence alignment Evaluation of Alignment Methods Summary
/course/eleg667-01-f/Topic-2c 3
Pair-wise sequence comparison
“In biomolecular sequences (DNA, RNA, Protein), high sequence similarity usually implies significant functional or structural similarity.”
Underlies the effectiveness of pair-wise sequence comparison and of biological database searching
Find sequences that have common sub-patterns but may not have been known to be biologically related.
/course/eleg667-01-f/Topic-2c 4
Multiple sequence comparison
“Evolutionarily and functionally related molecular sequences can differ significantly at the sequence level and yet preserve similar function and/or structure.
Underlies the effectiveness of multiple sequence comparison.
Deduce unknown conserved patterns from a set of sequences already known to be biologically related.
/course/eleg667-01-f/Topic-2c 5
Common MSA Applications
Characterization and representation of protein families and later identification of other potential members of the family;
Identification and representation of conserved sequence features that correlate with structure and function;
Deduction of evolutionary history.
/course/eleg667-01-f/Topic-2c 6
Common MSA Applications
To detect/demonstrate homology between new sequences and existing families of sequences
To help predict the secondary and tertiary structures of new sequences
To suggest oligonucleotide primers for PCR
/course/eleg667-01-f/Topic-2c 7
Comparing Multiple Sequences
We can compare multiple sequences by aligning the sequences and assigning a score to the alignments
Multiple Sequence Alignment (MSA)
/course/eleg667-01-f/Topic-2c 8
Application of MSA
HomologySearch
(e.g. BLAST)
MSA
top scoring hits
Conserved regions
Evolution paths
…...
Database
/course/eleg667-01-f/Topic-2c 9
Definition of MSA
A Multiple Sequence Alignment is obtained by inserting into each sequence a (possibly zero) number of gaps so that the resulting sequences are of the same length and each column has at least one character different from ‘-’ (gap).
IMAGINABLEIMPRACTICABLEINFALLIBLE
IM—-AG-INABLEIMPRACTICABLEIN-FALLI--BLE
IM-—-AG-INABLEIM-PRACTICABLEIN--FALLI--BLE
/course/eleg667-01-f/Topic-2c 10
How to score an alignment?
The Sum-of-Pairs (SP) score: A multiple alignment implies a pair-wise alignment for
each pair of sequences; SP defines the score of multiple alignment as the sum
of scores of all implied pair-wise alignments.
A A C G T A C G A T A
A – C G T A – A A T G
G T C G T A - - T T A
match = 1mismatch = 0gap-character = -1gap-gap = 0
5
34 SP score = 12
1 –2 3 3 3 3 –2 –2 1 3 1 = 12
Note: score (-,-) = 0
/course/eleg667-01-f/Topic-2c 11
MSA using dynamic programming
C G T G
A
T
G
A
G C--
C G- G- -
C G T- G T- - -
C G T -- G T A- - - A
C G T - G- G T A -- - - A G
7 calculations/cell
If k sequences of size n then:
O(nk) space and
O(k22knk) time
nk cells2k–1 calculations/cellk(k-1)/2 calculations to compute the SP-score
/course/eleg667-01-f/Topic-2c 12
Recall the pair-wise case ?
Queston: when DP comparison ends – how many possible distinct paths have been explored in total for this example?
Answer: Let us count Total = 13
G A 0 -1 -2
A -1 0 0
G -2 0 0
3 5 7
1 2 4
6 8 9
Question: from 1 to 9 how many paths?
1
3 5 2
86
9 9 9 9 9 99
9 9 9
9 9 9
8 7
8 78
5
5
8 7
477
/course/eleg667-01-f/Topic-2c 13
Queston: when DP comparison ends – how many possible distinct paths have been explored in total for this example?
Answer: Let us count Total = 13
G A 0 -1 -2
A -1 0 0
G -2 0 0
3 5 7
1 2 4
6 8 9
Question: from 1 to 9 how many paths?1
3 5 2
86
9 9 9 9 9 999 9 9
9 9 9
8 7
8 78
5
5
8 7477
Assume we have 3 sequences:AGACGC
How to do DP?
Question: When DP comparison ends - how many possible distinct path have been explored in total?
Answer: Count!
Align MultipleSequences
A GA
CG
C
/course/eleg667-01-f/Topic-2c 14
MSA Using DP with Heuristics
How to cut down the search space (# of calculations) at each step?
One way is to eliminate pairwise projections which does not contribute to the optimal alignment – develop such a test.
/course/eleg667-01-f/Topic-2c 15
Other MSA Methods Using Heuristics
Star Alignment: Build a multiple alignment based upon the pair-
wise alignments between a fixed sequence – called the “center” of the input set and all others.
Tree Alignment: Build a multiple alignment based upon the pair-
wise alignments along edges of a tree relating all the sequences.
/course/eleg667-01-f/Topic-2c 16
Star Alignment
Given k sequences Pick one of the sequences as the center Find optimal pair-wise alignments
between the center sequence and each other sequence.
Aggregate the pair-wise alignments (progressive alignment)
/course/eleg667-01-f/Topic-2c 17
Aggregate Step
Using the center Sc as a guide Starting with one pairwise alignment, say Sc
and S1, and aggregate the rest pairs one at a time
When add one pair (Si, Sc) in, make sure we progressively increase the gaps in Sc to suit further alignment, never removing gaps.
/course/eleg667-01-f/Topic-2c 18
Star Alignment (cont.)
How should we select the center sequence?
Build a table with the pair-wise similarity score for each pair of sequences.
Choose the sequence with the highest sum of scores.
/course/eleg667-01-f/Topic-2c 19
Star Alignment (cont.)
S1 = ATTGCCATTS2 = ATGGCCATTS3 = ATCCAATTTTS4 = ATCTTCTTS5 = ACTGACC
S1 S2 S3 S4 S5
S1
S2
S3
S4
S5
-2 0 -3
7 -2 0 -4
-2 –2 0 -7
0 0 0 -3
-3 -4 -7 -3
S2
7
A T T G C C A T T
A
T
G
G
C
C
A
T
T
1 -1 -3 -5 -7 -9 -11 -13 -15
0 -2 -4 -6
-2
-4
-6
-8 -10 -12
-8
-10
-14
-12
-16 -18
-1 2 0 -2 -4 -6 -8 -10 -12
-3 0 1 1 -1 -3 -5 -7 -9
-5 -2 -1 2 0 -2 -4 -6 -8
-7 -4 -3 0 3 1 -1 -3 -5
-9 -6 -5 -2 1 4 2 0 -2
-14 -11 -8 -7 -4 -1 2 5 3 1
-16 -13 -10 -9 -6 -3 0 3 6 4
-18 -15 -12 -11 -8 -5 -2 1 4 7
S1
Score = 7S1 = ATTGCCATTS2 = ATGGCCATTSo S1 is picked as the center
7
For k sequences, each size n
Time =T1 = O((k.(k-1)/2).n2) =O(k2.n2)
/course/eleg667-01-f/Topic-2c 20
Star Alignment (cont.)
S1
S2
S3
S4S5
S1 = ATTGCCATTS5 = ACTGACC--
S1 = ATTGCCATTS2 = ATGGCCATT
S1 = ATTGCCATT--S3 = ATC-CAATTTT
S1 = ATTGCCATTS4 = ATCTTC-TT
“Once a gap, always a gap”
S1 = ATTGCCATTS2 = ATGGCCATTS3 = ATC-CAATTTTS4 = ATCTTC-TTS5 = ACTGACC--
S1 = ATTGCCATT—-S2 = ATGGCCATT--S3 = ATC-CAATTTTS4 = ATCTTC-TT--S5 = ACTGACC----
For k sequences, each size n, and an upper bound on the alignment length of a:
Time =T2 = O((k-1).n2 + (k-1)2.a )
T1+T2 = O(k2.n2 + k.n2 + (k-1)2.a)
/course/eleg667-01-f/Topic-2c 21
Issues in Star Alignment
How to select the best anchor ? How to determine the order of progression ?
/course/eleg667-01-f/Topic-2c 22
Tree Alignment
Uses a clustering technique to order groups of related sequences in a hierarchical tree;
Based on the tree hierarchy (order from leaves to root), the multiple sequence alignment is generated by aligning and combining groups of sequences;
/course/eleg667-01-f/Topic-2c 23
The Basic Idea of Tree Alignment
(c) A pair-wise distance matrixS1 S2 S3 S4 S5 S6 S7 S8 S9
S1S2S3S4S5S6S7S8S9
S4
S2
S5
S7
S3
S6 S8S1
S9
(a) A set of sequences
S1
S4S3
S7
S9
S6
S8
(a) A set of sequences
S2
S5
/course/eleg667-01-f/Topic-2c 24
Question: In general, given pair-wise distances between a set S of objects (e.g. distance matrix), how to derive a weighted tree T where each leaf of T corresponds to an object in S, and the distance between two leafs i, j correspond to the distance between i and j in S?
Answer: This problem is an important problem in computation biology, and has been studied by many authors using a variable of techniques.
Clustal W – A Tool of Progressive Multiple
Sequence Alignment with Improved Sensitivity
/course/eleg667-01-f/Topic-2c 26
CLUSTAL W (Cont.)
All pairs of sequences are aligned separately in order to calculate a distance matrix giving the divergence of each pair of sequences;
A guide tree is calculated from the distance matrix;
The sequences are progressively aligned according to the branching order in the guide tree.
/course/eleg667-01-f/Topic-2c 27
CLUSTAL W (Cont.)
S1
S2
S3
S4
S1 S2 S3 S4
S1
S2
S3
S4
D12D13 D14
D23 D24
D34
Distance matrix
S1 S3S2 S4
Guide Tree
/course/eleg667-01-f/Topic-2c 28
CLUSTAL W (Cont.)
S1 S3S2 S4
Guide Tree
S2
S4
S1
S3
gaps to optimize alignment
Align mostsimilar pair
Align next mostsimilar pair
S2
S4 Align alignments,preserve gapsS1
S3
new gap to optimize alignmentof (S1S3)with (S2S4 )
/course/eleg667-01-f/Topic-2c 29
Clastal-W: Some Implementation Hints
/course/eleg667-01-f/Topic-2c 30
Distance Matrix
Initially all sequences are pairwised aligned.
S1S2
Sn
S3S1
S2 = 7
S3
S1= 8
S1 S2 S3 S4 S5 S6 S7
S2S3S4S5S6S7S8 17 14 11 12 10 13 8
511 8 513 10 7 816 13 10 11 513 10 7 8 6 9
7 8
/course/eleg667-01-f/Topic-2c 31
Two Options for Pairwise Alignment
Fast approximate method (Bashford,D.,Chothia.,C., 1987,J.Mol.Biol.) Allows large number of seqs to be aligned even
on a microcomputer Fully dynamic programming alignments
(Myers,E.,Miller,W., 1988,CABIOS) Two gap penalties Full weight matrix
/course/eleg667-01-f/Topic-2c 32
The Guide Tree
Unrooted tree Calculated from distance matrix
( Neighbour-Joining Method )
Rooted tree Calculated from unrooted tree
( Middle Point Method )
/course/eleg667-01-f/Topic-2c 33
Unrooted Tree
Neighbor Joining Method provides not only the topology but also the branch lengths (Fitch, Margoliash) of the final tree
Each node represents a sequence Each path length represents the distance
between two specific sequences
/course/eleg667-01-f/Topic-2c 34
Unrooted Tree - Example
AB
C D
E
F
S1
S2S3
S4S5
S6 S7S8
L1
L2
L3
L4 L5
L6 L7
L8
/course/eleg667-01-f/Topic-2c 35
Neighbour Joining Method
X X Y
S12 = Sum of all branch lengths = f(D’s)
S1
S2 S3S4
S5
S6
S7S8
S1
S2
S3
S4
S5
S6
S7 S8
/course/eleg667-01-f/Topic-2c 36
NJ-Method Example
/course/eleg667-01-f/Topic-2c 37
PSI-BLAST
Observation
Database searches using position-specific score matrices, also called profiles or motifs, often are much better able to detect weak relationships than are database searches that use a simple sequences as query
/course/eleg667-01-f/Topic-2c 38
PSI-BLAST
PSI-BLAST uses a procedure to contruct a position-specific score matrix automatically from the output of a BLAST run, and modified BLAST to operate using such a matrix in the place of a simple query
The resulting PSI-BLAST program often is substantially more sensitive than the corresponding BLAST program.
Cont’d
/course/eleg667-01-f/Topic-2c 39
PSI-BLAST and Multiple Sequence Alignment
PSI-BLAST also produce a multiple sequence alignment with the query sequence as a master template
Collect all hits with E-value below a theshold-say 0.01, and Do not include copies of sequences identical to the query Retain one copy for each hit which is very similar to the
query Other details
The MSA constructed is used by PSI-BLAST for construction a scoring matrix
/course/eleg667-01-f/Topic-2c 40
Where PSI-BLAST Differ from Other “True” MSA Methods?
PSI-BLAST deals with local alignments, so each columns of M (the multiple alignment) may involve varying numbers of sequences. In fact, some columns may include only the query sequence itself.
/course/eleg667-01-f/Topic-2c 41
Classification of Multiple Sequence Alignment Methods
STAR
MULTAL
Tree
MULTALIGNPILEUP
CIUSTA-W
Global Alignment Local Alignment
Progressive
MSA
Iterative(local)
DALIGN
HMM(HMMT)
GeneticAlgorithm(SAGA)
PSI-BLAST
PIMA
How to Compare Alignment Software ?
/course/eleg667-01-f/Topic-2c 43
ASVIE-AAVIVIVI-EPAAG
Remote users
• Download fasta sequences • Produce set of sequence alignment • Submit the resulted alignments• Benchmarking program evaluates parameters
Fasta proteinsdatabase
User 1
H i g h S p e e d N e t w o r k
Benchmark ServerFasta proteins
databaseCE alignments
database
Web InteractiveBenchmarking
Program
User 4User 3User 2
Structural alignmentSequence alignment
A-SVIE-AAV-VIVI-EPAAG
CASA -- A Server for the Critical Asessment of Protein Sequence Alignment Accuracy
/course/eleg667-01-f/Topic-2c 44
Methods of Discovery of Biological Sequence Homology
Alignment
Pair wise MSA
Optimal
Global Local
DP
ATGC
Heuristic Heuristic
FAST BLAST Progressive Iterative
(See slide 41)
Pattern Matching
EventuationAll and
verify
Scan SeedsAnd???
Combined
FLASH
Discover
TEIRESIAS
MOTIF/ASSET
PRATT