protein sequence comparison patrice koehl

50
Protein Sequence Comparison Patrice Koehl http://koehllab.genomecenter.ucdavis.edu/teaching/ecs129/ 06/lecture-notes

Post on 22-Dec-2015

224 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Protein Sequence Comparison Patrice Koehl

Protein Sequence ComparisonPatrice Koehl

http://koehllab.genomecenter.ucdavis.edu/teaching/ecs129/06/lecture-notes

Page 2: Protein Sequence Comparison Patrice Koehl

Why do we want to align protein sequences?

• If two sequences align well, the corresponding proteins are homologous; they probably share the same structure and/or function

• Sequence Alignment is a Tool for organizing the protein sequence space

detection of homologous proteins

build evolutionary history

Page 3: Protein Sequence Comparison Patrice Koehl

Alignment Methods

• Rigorous algorithms

- Needleman-Wunsch (global)

- Smith-Waterman (local)

• Rapid heuristics

- FASTA

- BLAST

Page 4: Protein Sequence Comparison Patrice Koehl

What is sequence alignment?

• Given two sequences of letters and a scoring scheme for evaluating letter matching, find the optimal pairing of letters from one sequence to the other.

• Different alignments:

Favors identity Favors similarity

ACCTAGGC ACCTAGGC

AC-T-GG ACT-GG

Gaps

Page 5: Protein Sequence Comparison Patrice Koehl

Aligning Biological Sequences

• Aligning DNA: 4 letter alphabet

ACGTTGGC

AC-T-GG

• Aligning protein: 20 letter alphabet

MCYTSWGC

MC-T-WG

Page 6: Protein Sequence Comparison Patrice Koehl

Computing Cost

The computational complexity of aligning two sequences when gaps are allowed anywhere is exponential in the length of the sequences being aligned.

Computer science offers a solutionfor reducing the running time:Dynamic Programming

Page 7: Protein Sequence Comparison Patrice Koehl

Dynamic Programming (DP) Concept

A problem with overlapping sub-problems and optimal sub-structures can be solved using the following algorithm:

(1) break the problem into smaller sub-problems

(2) solve these problems optimally using this 3-step procedure recursively

(3) use these optimal solutions to construct an optimal solution to the original problem

Page 8: Protein Sequence Comparison Patrice Koehl

DP and Sequence Alignment

Key idea:

The score of the optimal alignment that ends at a givenpair of positions in the sequences is the score of the best alignment previous to these positions plus the score ofaligning these two positions.

Page 9: Protein Sequence Comparison Patrice Koehl

Test all alignments that can lead to i aligned with j

i

j

?

DP and Sequence Alignment

j

i

?

Page 10: Protein Sequence Comparison Patrice Koehl

Find alignment with best [previous score + score(i,j)]i

j

?

DP and Sequence Alignment

j

i

Best alignment that ends at (i,j)

Page 11: Protein Sequence Comparison Patrice Koehl

Implementing the DP algorithm for sequences

1) Build a NxM alignment matrix A such thatA(i,j) is the optimal score for alignments

up to the pair (i,j)

2) Find the best score in A

3) Track back through the matrix to get the optimal alignment of S1 and S2.

Aligning 2 sequence S1 and S2 of lengths N and M:

Page 12: Protein Sequence Comparison Patrice Koehl

Example 1

Sequence 1: ATGCTGC

Sequence 2: AGCC

Score(i,j) = 10 if i=j, 0 otherwise

no gap penalty

Page 13: Protein Sequence Comparison Patrice Koehl

Example 1

A T G C T G C

A 10 0 0 0 0 0 0

G 0

C 0

C 0

1) Initialize

Page 14: Protein Sequence Comparison Patrice Koehl

Example 1

A T G C T G C

A 10 0 0 0 0 0 0

G 0 10

C 0

C 0

2) Propagate

Page 15: Protein Sequence Comparison Patrice Koehl

Example 1

A T G C T G C

A 10 0 0 0 0 0 0

G 0 10 20

C 0

C 0

2) Propagate

Page 16: Protein Sequence Comparison Patrice Koehl

Example 1

A T G C T G C

A 10 0 0 0 0 0 0

G 0 10 20 10 10 20 10

C 0 10

C 0

2) Propagate

Page 17: Protein Sequence Comparison Patrice Koehl

Example 1

A T G C T G C

A 10 0 0 0 0 0 0

G 0 10 20 10 10 20 10

C 0 10 10 30 20 20 30

C 0 10 10 30 30

2) Propagate

Page 18: Protein Sequence Comparison Patrice Koehl

Example 1

A T G C T G C

A 10 0 0 0 0 0 0

G 0 10 20 10 10 20 10

C 0 10 10 30 20 20 30

C 0 10 10 30 30 30 40

3) Trace Back

ATGCTGCAXGCXXC

Alignment: Score: 40

Page 19: Protein Sequence Comparison Patrice Koehl

Mathematical Formulation

k2ik1

k2jk1

W1)jk,1A(i

,Wk)1j1,A(i

1),j1,A(i

j)Score(i,j)A(i,

max

maxmax

Wk: penalty for a gap of size k

Global alignment (Needleman-Wunsch):

Page 20: Protein Sequence Comparison Patrice Koehl

Complexity

1) The computing time required to fill in thealignment matrix is O(NM(N+M)), where N and M are the lengths of the 2 sequences

2) This can be reduced to O(NM) by storingthe best score for each row and column.

True if gap penalty is linear!

Page 21: Protein Sequence Comparison Patrice Koehl

Example 2

Alignments:

High Score: 30

A A T G C

A 10 10 0 0 0

G 0 10 10 20 10

G 0 10 10 20 20

C 0 10 10 10 30

AATGCAG GC

AATGCA GGC

AATGC AGGC

AATG CA GGC

AATG C A GGC

Page 22: Protein Sequence Comparison Patrice Koehl

Example 2 with Gap

Alignments:

High Score: 28

A A T G C

A 10 8 -2 -2 -2

G -2 10 8 18 8

G -2 8 10 18 16

C -2 8 8 10 28

AATGCAG GC

AATGCA GGC

AATGC AGGC

Gap cost: -2

Page 23: Protein Sequence Comparison Patrice Koehl

Complexity (2)

1) The traceback routine can be quite costly in computingtime if all possible optimal paths are required, sincethere may be many branches.

2) Usually, an arbitrary choice is made about whichbranch to follow. Then computing time is O(max(N,M))By simply following pointers.

Page 24: Protein Sequence Comparison Patrice Koehl

The Scoring Scheme

• Scores are usually stored in a “weight” matrix or “substitution” matrix or “matching” matrix.

• Defining the “proper” matrix is still an active area of research

• Usually, start from known, reliable alignment. Compute fi, the frequency of occurrence of residue type i, and qij, the probability that residue types i and j are aligned; score is computed as:

ji

ijij ff

qS log

Page 25: Protein Sequence Comparison Patrice Koehl

C S T P A G N D E Q H R K M I L V F Y W

C 9 -1 -1 -3 0 -3 -3 -3 -4 -3 -3 -3 -3 -1 -1 -1 -1 -2 -2 -2

S -1 4 1 -1 1 0 1 0 0 0 -1 -1 0 -1 -2 -2 -2 -2 -2 -3

T -1 1 4 1 -1 1 0 1 0 0 0 -1 0 -1 -2 -2 -2 -2 -2 -3

P -3 -1 1 7 -1 -2 -1 -1 -1 -1 -2 -2 -1 -2 -3 -3 -2 -4 -3 -4

A 0 1 -1 -1 4 0 -1 -2 -1 -1 -2 -1 -1 -1 -1 -1 -2 -2 -2 -3

G -3 0 1 -2 0 6 -2 -1 -2 -2 -2 -2 -2 -3 -4 -4 0 -3 -3 -2

N -3 1 0 -2 -2 0 6 1 0 0 -1 0 0 -2 -3 -3 -3 -3 -2 -4

D -3 0 1 -1 -2 -1 1 6 2 0 -1 -2 -1 -3 -3 -4 -3 -3 -3 -4

E -4 0 0 -1 -1 -2 0 2 5 2 0 0 1 -2 -3 -3 -3 -3 -2 -3

Q -3 0 0 -1 -1 -2 0 0 2 5 0 1 1 0 -3 -2 -2 -3 -1 -2

H -3 -1 0 -2 -2 -2 1 1 0 0 8 0 -1 -2 -3 -3 -2 -1 2 -2

R -3 -1 -1 -2 -1 -2 0 -2 0 1 0 5 2 -1 -3 -2 -3 -3 -2 -3

K -3 0 0 -1 -1 -2 0 -1 1 1 -1 2 5 -1 -3 -2 -3 -3 -2 -3

M -1 -1 -1 -2 -1 -3 -2 -3 -2 0 -2 -1 -1 5 1 2 -2 0 -1 -1

I -1 -2 -2 -3 -1 -4 -3 -3 -3 -3 -3 -3 -3 1 4 2 1 0 -1 -3

L -1 -2 -2 -3 -1 -4 -3 -4 -3 -2 -3 -2 -2 2 2 4 3 0 -1 -2

V -1 -2 -2 -2 0 -3 -3 -3 -2 -2 -3 -3 -2 1 3 1 4 -1 -1 -3

F -2 -2 -2 -4 -2 -3 -3 -3 -3 -3 -1 -3 -3 0 0 0 -1 6 3 1

Y -2 -2 -2 -3 -2 -3 -2 -3 -2 -1 2 -2 -2 -1 -1 -1 -1 3 7 2

W -2 -3 -3 -4 -3 -2 -4 -4 -3 -2 -2 -3 -3 -1 -3 -2 -3 1 2 11

Example of a Scoring matrix

Page 26: Protein Sequence Comparison Patrice Koehl

Gap penalty

• Most common model:

WN = G0 + N * G1

WN : gap penalty for a gap of size NG0 : cost of opening a gapG1 : cost of extending the gap by oneN : size of the gap

Page 27: Protein Sequence Comparison Patrice Koehl

Global versus Local Alignment

• Global alignment finds best arrangement that maximizes total score

• Local alignment identifies highest scoring subsequences, sometimes at the expense of the overall score

Local alignment algorithm is just a variationof the global alignment algorithm!

Page 28: Protein Sequence Comparison Patrice Koehl

Modifications for local alignment

1) The scoring matrix has negative values for mismatches

2) The minimum score for any (i,j) in the alignment matrix is 0.

3) The best score is found anywhere in the filled alignment matrix

These 3 modifications cause the algorithm to search for matching sub-sequences which are not penalized by other regions (modif. 2), with minimal poor matches (modif 1), which can occur anywhere (modif 3).

Page 29: Protein Sequence Comparison Patrice Koehl

Mathematical Formulation

Wk: penalty for a gap of size k

Local alignment (Smith Waterman):

0

,max

maxmax

k2ik1

k2jk1

W1)jk,1A(i

,Wk)1j1,A(i

1),j1,A(i

j)Score(i,j)A(i,

Page 30: Protein Sequence Comparison Patrice Koehl

Global versus Local Alignment

A C C T G S

A 1 0 0 0 0 0

C 0 2 1 0 0 0

C 0 1 3 0 0 0

N 0 0 0 1 0 0

S 0 0 0 0 0 1

Match: +1; Mismatch: -2; Gap: -1

A C C T G S

A 1 -3 -3 -3 -3 -3

C -3 2 1 -2 -2 -2

C -3 1 3 -1 -1 -1

N -3 -2 -1 1 0 0

S -3 -2 -1 0 -1 1

Global: ACCTGSACC-NS

Local: ACCACC

Page 31: Protein Sequence Comparison Patrice Koehl

Heuristic methods

• O(NM) is too slow for database search

• Heuristic methods based on frequency of shared subsequences

• Usually look for ungapped small sequences

FASTA, BLAST

Page 32: Protein Sequence Comparison Patrice Koehl

FASTA

• Create hash table of short words of the query sequence (from 2 to 6 characters)

• Scan database and look for matches in the query hash table• Extend good matches empirically

Seq1 Seq2 Seq3 Seq4 Seq5 Seq6 Seq7 … SeqN

Word1

Word2

Word3

WordP

Page 33: Protein Sequence Comparison Patrice Koehl

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.htmlTutorial:

BLAST

1) Break query sequence and database sequences into words

2) Search for matches (even not perfect) that scores at least T

3) Extend matches, and look for alignment that scores at least S

Page 34: Protein Sequence Comparison Patrice Koehl

Summary

• Dynamic programming finds the optimal alignment between two sequences in a computing time proportional to NxM, where N and M are the sequence lengths

• Critical user choices are the scoring matrix, the gap penalties, and the algorithm (local or global)

Page 35: Protein Sequence Comparison Patrice Koehl

Statistics of Sequence Alignment

Page 36: Protein Sequence Comparison Patrice Koehl

Significance

• We have found that the score of the alignment between two sequences is S.

Question: What is the “significance” of this score?

• Otherwise stated, what is the probability P that the alignment of two random sequences has a score at least equal to S ?

• P is the P-value, and is considered a measure of statistical significance.

If P is small, the initial alignment is significant.

Page 37: Protein Sequence Comparison Patrice Koehl

A given experiment may yield the event A or the event not(A) with probabilities p,

and q=1-p, respectively. If the experiment is repeated N times and X is the number

of time A is observed, then the probability that X takes the value k is given by:

Basic Statistics: Binomial Distribution

kNk ppk

NkXp

)1()(

)!(!

!

kNk

N

k

N

With the binomial coefficient:

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Properties:

E(X)=Np

Var(X)=Np(1-p)

N=40; p=0.2

Page 38: Protein Sequence Comparison Patrice Koehl

The Poisson distribution is a discrete distribution usually defined over a volume

or time interval.

Given a process with expected number of success in the given interval, the

probability of observing exactly X success is given by:

Basic Statistics: Poisson Distribution

!

exp)(

XXp X

=8

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Properties:

E(X)=

Var(X)=

Page 39: Protein Sequence Comparison Patrice Koehl

A normal distribution in a variable X with mean and variance 2 is a statistical

distribution with probability function:

Basic Statistics: Normal Distribution

2

2

2

)(exp

2

1)(

X

Xp

00.020.040.060.080.1

0.120.140.160.18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

N=40; p=0.2

The normal distribution is the limitingcase of a binomial distribution P(n,N,p)with:

)1(2 pNp

Np

Page 40: Protein Sequence Comparison Patrice Koehl

A extreme value distribution in a variable X is a statistical distribution with

probability function:

Basic Statistics: Extreme Value Distribution

b

xa

b

xa

bXp expexp

1)(

a=0; b=1

-10 -8 -6 -4 -2 0 2 4 6 8 10

Note the presence of a long tail

Page 41: Protein Sequence Comparison Patrice Koehl

BLAST Input

Page 42: Protein Sequence Comparison Patrice Koehl

BLAST results

Page 43: Protein Sequence Comparison Patrice Koehl

BLAST results (2)

Page 44: Protein Sequence Comparison Patrice Koehl

Statistics of Protein Sequence Alignment

• Statistics of global alignment:

Unfortunately, not much is known! Statistics based on Monte Carlo simulations (shuffle one sequence and recompute alignment to get a distribution of scores)

• Statistics of local alignment

Well understood for ungapped alignment. Same theory probably apply to gapped-alignment

Page 45: Protein Sequence Comparison Patrice Koehl

Statistics of Protein Sequence Alignment

What is a local alignment ?

“Pair of equal length segments, one from each sequence, whose scores can not be improved by extension or trimming. These are called high-scoring pairs, or HSP”

http://www.people.virginia.edu/~wrp/cshl98/Altschul/Altschul-1.html

Page 46: Protein Sequence Comparison Patrice Koehl

The E-value for a sequence alignment

-10 -8 -6 -4 -2 0 2 4 6 8 10

S

The expected number ofHSP with score at least S isgiven by:

HSP scores follow an extreme value distribution, characterized by two parameters, K and .

SKmnE exp

m, n : sequence lengthsE : E-value

Page 47: Protein Sequence Comparison Patrice Koehl

The Bit Score of a sequence alignment

2ln

ln'

KSS

Raw scores have little meaning without knowledge of the scoring scheme used for the alignment, or equivalently of the parameters K and .Scores can be normalized according to:

S’ is the bit score of the alignment.

The E-value can be expressed as:

'2 SmnE

Page 48: Protein Sequence Comparison Patrice Koehl

The P-value of a sequence alignment

ESscorewithHSPrandomP exp0

!

expX

EESscorewithHSPrandomXP

X

The number of random HSP with score greater of equal to S follows aPoisson distribution:

(E: E-value)Then:

ESscorewithHSPrandomleastatPPval exp11

Note: when E <<1, P ≈E

Page 49: Protein Sequence Comparison Patrice Koehl

Database search, where database contains NS sequencescorresponding to NR residues:

1) All sequences are a priori equally likely to be related to the query:

2) Longer sequences are more likely to be related to the query:

BLAST reports EDB2

The database E-value for a sequence alignment

SKmnNE SDB exp

SKmNE RDB exp2

Page 50: Protein Sequence Comparison Patrice Koehl

Summary

• Statistics on local sequence alignment are defined by:– Raw score– Bit score (normalized score)– E-value– P-value

• Statistics on database search:- database E-value