multiple sequence comparison. /course/eleg667-01-f/topic-2c2 outline motivation multiple sequence...

44
Multiple Sequence Comparison

Post on 19-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

Multiple Sequence Comparison

Page 2: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 2

Outline Motivation Multiple Sequence Alignment using Dynamic

programming Multiple Sequence Alignment using Heuristics

Star Alignments Tree Alignments (CLUSTAL W)

PSI-BLAST and multiple sequence alignment Evaluation of Alignment Methods Summary

Page 3: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 3

Pair-wise sequence comparison

“In biomolecular sequences (DNA, RNA, Protein), high sequence similarity usually implies significant functional or structural similarity.”

Underlies the effectiveness of pair-wise sequence comparison and of biological database searching

Find sequences that have common sub-patterns but may not have been known to be biologically related.

Page 4: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 4

Multiple sequence comparison

“Evolutionarily and functionally related molecular sequences can differ significantly at the sequence level and yet preserve similar function and/or structure.

Underlies the effectiveness of multiple sequence comparison.

Deduce unknown conserved patterns from a set of sequences already known to be biologically related.

Page 5: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 5

Common MSA Applications

Characterization and representation of protein families and later identification of other potential members of the family;

Identification and representation of conserved sequence features that correlate with structure and function;

Deduction of evolutionary history.

Page 6: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 6

Common MSA Applications

To detect/demonstrate homology between new sequences and existing families of sequences

To help predict the secondary and tertiary structures of new sequences

To suggest oligonucleotide primers for PCR

Page 7: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 7

Comparing Multiple Sequences

We can compare multiple sequences by aligning the sequences and assigning a score to the alignments

Multiple Sequence Alignment (MSA)

Page 8: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 8

Application of MSA

HomologySearch

(e.g. BLAST)

MSA

top scoring hits

Conserved regions

Evolution paths

…...

Database

Page 9: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 9

Definition of MSA

A Multiple Sequence Alignment is obtained by inserting into each sequence a (possibly zero) number of gaps so that the resulting sequences are of the same length and each column has at least one character different from ‘-’ (gap).

IMAGINABLEIMPRACTICABLEINFALLIBLE

IM—-AG-INABLEIMPRACTICABLEIN-FALLI--BLE

IM-—-AG-INABLEIM-PRACTICABLEIN--FALLI--BLE

Page 10: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 10

How to score an alignment?

The Sum-of-Pairs (SP) score: A multiple alignment implies a pair-wise alignment for

each pair of sequences; SP defines the score of multiple alignment as the sum

of scores of all implied pair-wise alignments.

A A C G T A C G A T A

A – C G T A – A A T G

G T C G T A - - T T A

match = 1mismatch = 0gap-character = -1gap-gap = 0

5

34 SP score = 12

1 –2 3 3 3 3 –2 –2 1 3 1 = 12

Note: score (-,-) = 0

Page 11: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 11

MSA using dynamic programming

C G T G

A

T

G

A

G C--

C G- G- -

C G T- G T- - -

C G T -- G T A- - - A

C G T - G- G T A -- - - A G

7 calculations/cell

If k sequences of size n then:

O(nk) space and

O(k22knk) time

nk cells2k–1 calculations/cellk(k-1)/2 calculations to compute the SP-score

Page 12: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 12

Recall the pair-wise case ?

Queston: when DP comparison ends – how many possible distinct paths have been explored in total for this example?

Answer: Let us count Total = 13

G A 0 -1 -2

A -1 0 0

G -2 0 0

3 5 7

1 2 4

6 8 9

Question: from 1 to 9 how many paths?

1

3 5 2

86

9 9 9 9 9 99

9 9 9

9 9 9

8 7

8 78

5

5

8 7

477

Page 13: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 13

Queston: when DP comparison ends – how many possible distinct paths have been explored in total for this example?

Answer: Let us count Total = 13

G A 0 -1 -2

A -1 0 0

G -2 0 0

3 5 7

1 2 4

6 8 9

Question: from 1 to 9 how many paths?1

3 5 2

86

9 9 9 9 9 999 9 9

9 9 9

8 7

8 78

5

5

8 7477

Assume we have 3 sequences:AGACGC

How to do DP?

Question: When DP comparison ends - how many possible distinct path have been explored in total?

Answer: Count!

Align MultipleSequences

A GA

CG

C

Page 14: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 14

MSA Using DP with Heuristics

How to cut down the search space (# of calculations) at each step?

One way is to eliminate pairwise projections which does not contribute to the optimal alignment – develop such a test.

Page 15: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 15

Other MSA Methods Using Heuristics

Star Alignment: Build a multiple alignment based upon the pair-

wise alignments between a fixed sequence – called the “center” of the input set and all others.

Tree Alignment: Build a multiple alignment based upon the pair-

wise alignments along edges of a tree relating all the sequences.

Page 16: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 16

Star Alignment

Given k sequences Pick one of the sequences as the center Find optimal pair-wise alignments

between the center sequence and each other sequence.

Aggregate the pair-wise alignments (progressive alignment)

Page 17: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 17

Aggregate Step

Using the center Sc as a guide Starting with one pairwise alignment, say Sc

and S1, and aggregate the rest pairs one at a time

When add one pair (Si, Sc) in, make sure we progressively increase the gaps in Sc to suit further alignment, never removing gaps.

Page 18: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 18

Star Alignment (cont.)

How should we select the center sequence?

Build a table with the pair-wise similarity score for each pair of sequences.

Choose the sequence with the highest sum of scores.

Page 19: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 19

Star Alignment (cont.)

S1 = ATTGCCATTS2 = ATGGCCATTS3 = ATCCAATTTTS4 = ATCTTCTTS5 = ACTGACC

S1 S2 S3 S4 S5

S1

S2

S3

S4

S5

-2 0 -3

7 -2 0 -4

-2 –2 0 -7

0 0 0 -3

-3 -4 -7 -3

S2

7

A T T G C C A T T

A

T

G

G

C

C

A

T

T

1 -1 -3 -5 -7 -9 -11 -13 -15

0 -2 -4 -6

-2

-4

-6

-8 -10 -12

-8

-10

-14

-12

-16 -18

-1 2 0 -2 -4 -6 -8 -10 -12

-3 0 1 1 -1 -3 -5 -7 -9

-5 -2 -1 2 0 -2 -4 -6 -8

-7 -4 -3 0 3 1 -1 -3 -5

-9 -6 -5 -2 1 4 2 0 -2

-14 -11 -8 -7 -4 -1 2 5 3 1

-16 -13 -10 -9 -6 -3 0 3 6 4

-18 -15 -12 -11 -8 -5 -2 1 4 7

S1

Score = 7S1 = ATTGCCATTS2 = ATGGCCATTSo S1 is picked as the center

7

For k sequences, each size n

Time =T1 = O((k.(k-1)/2).n2) =O(k2.n2)

Page 20: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 20

Star Alignment (cont.)

S1

S2

S3

S4S5

S1 = ATTGCCATTS5 = ACTGACC--

S1 = ATTGCCATTS2 = ATGGCCATT

S1 = ATTGCCATT--S3 = ATC-CAATTTT

S1 = ATTGCCATTS4 = ATCTTC-TT

“Once a gap, always a gap”

S1 = ATTGCCATTS2 = ATGGCCATTS3 = ATC-CAATTTTS4 = ATCTTC-TTS5 = ACTGACC--

S1 = ATTGCCATT—-S2 = ATGGCCATT--S3 = ATC-CAATTTTS4 = ATCTTC-TT--S5 = ACTGACC----

For k sequences, each size n, and an upper bound on the alignment length of a:

Time =T2 = O((k-1).n2 + (k-1)2.a )

T1+T2 = O(k2.n2 + k.n2 + (k-1)2.a)

Page 21: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 21

Issues in Star Alignment

How to select the best anchor ? How to determine the order of progression ?

Page 22: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 22

Tree Alignment

Uses a clustering technique to order groups of related sequences in a hierarchical tree;

Based on the tree hierarchy (order from leaves to root), the multiple sequence alignment is generated by aligning and combining groups of sequences;

Page 23: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 23

The Basic Idea of Tree Alignment

(c) A pair-wise distance matrixS1 S2 S3 S4 S5 S6 S7 S8 S9

S1S2S3S4S5S6S7S8S9

S4

S2

S5

S7

S3

S6 S8S1

S9

(a) A set of sequences

S1

S4S3

S7

S9

S6

S8

(a) A set of sequences

S2

S5

Page 24: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 24

Question: In general, given pair-wise distances between a set S of objects (e.g. distance matrix), how to derive a weighted tree T where each leaf of T corresponds to an object in S, and the distance between two leafs i, j correspond to the distance between i and j in S?

Answer: This problem is an important problem in computation biology, and has been studied by many authors using a variable of techniques.

Page 25: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

Clustal W – A Tool of Progressive Multiple

Sequence Alignment with Improved Sensitivity

Page 26: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 26

CLUSTAL W (Cont.)

All pairs of sequences are aligned separately in order to calculate a distance matrix giving the divergence of each pair of sequences;

A guide tree is calculated from the distance matrix;

The sequences are progressively aligned according to the branching order in the guide tree.

Page 27: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 27

CLUSTAL W (Cont.)

S1

S2

S3

S4

S1 S2 S3 S4

S1

S2

S3

S4

D12D13 D14

D23 D24

D34

Distance matrix

S1 S3S2 S4

Guide Tree

Page 28: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 28

CLUSTAL W (Cont.)

S1 S3S2 S4

Guide Tree

S2

S4

S1

S3

gaps to optimize alignment

Align mostsimilar pair

Align next mostsimilar pair

S2

S4 Align alignments,preserve gapsS1

S3

new gap to optimize alignmentof (S1S3)with (S2S4 )

Page 29: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 29

Clastal-W: Some Implementation Hints

Page 30: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 30

Distance Matrix

Initially all sequences are pairwised aligned.

S1S2

Sn

S3S1

S2 = 7

S3

S1= 8

S1 S2 S3 S4 S5 S6 S7

S2S3S4S5S6S7S8 17 14 11 12 10 13 8

511 8 513 10 7 816 13 10 11 513 10 7 8 6 9

7 8

Page 31: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 31

Two Options for Pairwise Alignment

Fast approximate method (Bashford,D.,Chothia.,C., 1987,J.Mol.Biol.) Allows large number of seqs to be aligned even

on a microcomputer Fully dynamic programming alignments

(Myers,E.,Miller,W., 1988,CABIOS) Two gap penalties Full weight matrix

Page 32: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 32

The Guide Tree

Unrooted tree Calculated from distance matrix

( Neighbour-Joining Method )

Rooted tree Calculated from unrooted tree

( Middle Point Method )

Page 33: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 33

Unrooted Tree

Neighbor Joining Method provides not only the topology but also the branch lengths (Fitch, Margoliash) of the final tree

Each node represents a sequence Each path length represents the distance

between two specific sequences

Page 34: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 34

Unrooted Tree - Example

AB

C D

E

F

S1

S2S3

S4S5

S6 S7S8

L1

L2

L3

L4 L5

L6 L7

L8

Page 35: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 35

Neighbour Joining Method

X X Y

S12 = Sum of all branch lengths = f(D’s)

S1

S2 S3S4

S5

S6

S7S8

S1

S2

S3

S4

S5

S6

S7 S8

Page 36: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 36

NJ-Method Example

Page 37: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 37

PSI-BLAST

Observation

Database searches using position-specific score matrices, also called profiles or motifs, often are much better able to detect weak relationships than are database searches that use a simple sequences as query

Page 38: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 38

PSI-BLAST

PSI-BLAST uses a procedure to contruct a position-specific score matrix automatically from the output of a BLAST run, and modified BLAST to operate using such a matrix in the place of a simple query

The resulting PSI-BLAST program often is substantially more sensitive than the corresponding BLAST program.

Cont’d

Page 39: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 39

PSI-BLAST and Multiple Sequence Alignment

PSI-BLAST also produce a multiple sequence alignment with the query sequence as a master template

Collect all hits with E-value below a theshold-say 0.01, and Do not include copies of sequences identical to the query Retain one copy for each hit which is very similar to the

query Other details

The MSA constructed is used by PSI-BLAST for construction a scoring matrix

Page 40: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 40

Where PSI-BLAST Differ from Other “True” MSA Methods?

PSI-BLAST deals with local alignments, so each columns of M (the multiple alignment) may involve varying numbers of sequences. In fact, some columns may include only the query sequence itself.

Page 41: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 41

Classification of Multiple Sequence Alignment Methods

STAR

MULTAL

Tree

MULTALIGNPILEUP

CIUSTA-W

Global Alignment Local Alignment

Progressive

MSA

Iterative(local)

DALIGN

HMM(HMMT)

GeneticAlgorithm(SAGA)

PSI-BLAST

PIMA

Page 42: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

How to Compare Alignment Software ?

Page 43: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 43

ASVIE-AAVIVIVI-EPAAG

Remote users

• Download fasta sequences • Produce set of sequence alignment • Submit the resulted alignments• Benchmarking program evaluates parameters

Fasta proteinsdatabase

User 1

H i g h S p e e d N e t w o r k

Benchmark ServerFasta proteins

databaseCE alignments

database

Web InteractiveBenchmarking

Program

User 4User 3User 2

Structural alignmentSequence alignment

A-SVIE-AAV-VIVI-EPAAG

CASA -- A Server for the Critical Asessment of Protein Sequence Alignment Accuracy

Page 44: Multiple Sequence Comparison. /course/eleg667-01-f/Topic-2c2 Outline  Motivation  Multiple Sequence Alignment using Dynamic programming  Multiple Sequence

/course/eleg667-01-f/Topic-2c 44

Methods of Discovery of Biological Sequence Homology

Alignment

Pair wise MSA

Optimal

Global Local

DP

ATGC

Heuristic Heuristic

FAST BLAST Progressive Iterative

(See slide 41)

Pattern Matching

EventuationAll and

verify

Scan SeedsAnd???

Combined

FLASH

Discover

TEIRESIAS

MOTIF/ASSET

PRATT