computational genomics (0382.3102) lecture 3 sequence ...bchor/cg/lecture3.pdfcomputational genomics...

32
Computational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap Penalties, Local Alignment, BLAST and FASTA Hueristics Prof. Benny Chor School of Computer Science Tel-Aviv University Based in part on chapter *** in Gusfield’s book, chapter 3 in Kanehisa’s book, and on a ppt presentation by Terry Speed (UC Berkeley) c Benny Chor – p.1

Upload: others

Post on 29-Oct-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

Computational Genomics (0382.3102)

Lecture 3

Sequence Similarity and Pairwise Alignment II:

Affine Gap Penalties, Local Alignment,

BLAST and FASTA Hueristics

Prof. Benny Chor

School of Computer Science

Tel-Aviv University

Based in part on chapter *** in Gusfield’s book, chapter 3 in Kanehisa’s book,

and on a ppt presentation by Terry Speed (UC Berkeley)c

Benny Chor – p.1

Page 2: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

DistancesLet

be a (finite or infinite) set. A distance on

is afunction � � � ��� � �

satisfying the followingthree properties:

� Symmetry:

��� � �

,

� �� � � � � �

.

Non-negativity: , , andif and only if .

Triangle inequality: ,

c

Benny Chor – p.2

Page 3: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

DistancesLet

be a (finite or infinite) set. A distance on

is afunction � � � ��� � �

satisfying the followingthree properties:

� Symmetry:

��� � �

,

� �� � � � � �

.

� Non-negativity:

��� � �,

� �� � �

, and� �� � � �

if and only if � .

Triangle inequality: ,

c

Benny Chor – p.2

Page 4: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

DistancesLet

be a (finite or infinite) set. A distance on

is afunction � � � ��� � �

satisfying the followingthree properties:

� Symmetry:

��� � �

,

� �� � � � � �

.

� Non-negativity:

��� � �,

� �� � �

, and� �� � � �

if and only if � .

� Triangle inequality:� �� �� � �

,� �� � � �� � � �� ��

c

Benny Chor – p.2

Page 5: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

Famous DistancesA distance on

, � � � ��� � �

is also called anorm in math jargon. Example of norms (for some ofthese it is not immediate to verify that triangleinequality holds).

� � �� � � �

if � (this norm is a bit boring).

Let ( -dim. real vectors) and .

In math jargon, this is known as the norm.

c

Benny Chor – p.3

Page 6: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

Famous DistancesA distance on

, � � � ��� � �

is also called anorm in math jargon. Example of norms (for some ofthese it is not immediate to verify that triangleinequality holds).

� � �� � � �

if � (this norm is a bit boring).

� Let

� �

(

-dim. real vectors) and � �

.� � �� � � � � � �� � � �� � � � � � � � �

� � ��� �

� � � � � �

In math jargon, this is known as the

� norm.

c

Benny Chor – p.3

Page 7: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

More Famous Distances

� � with � � �

is the “regular” Euclidean distance.

with (sum of absolute values ofdifferences).

The norm: (the limit of as).

Let be a finite, undirected, connectedgraph,with positive edges’ lengths. Letbe a pair of vertices.

Define the length of the shortest pathfrom to in . This is a norm.

c

Benny Chor – p.4

Page 8: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

More Famous Distances

� � with � � �

is the “regular” Euclidean distance.

� � with � � �

(sum of absolute values ofdifferences).

The norm: (the limit of as).

Let be a finite, undirected, connectedgraph,with positive edges’ lengths. Letbe a pair of vertices.

Define the length of the shortest pathfrom to in . This is a norm.

c

Benny Chor – p.4

Page 9: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

More Famous Distances

� � with � � �

is the “regular” Euclidean distance.

� � with � � �

(sum of absolute values ofdifferences).

� The

� norm: � �� ��� �

� � � � (the limit of

� as

� ).

Let be a finite, undirected, connectedgraph,with positive edges’ lengths. Letbe a pair of vertices.

Define the length of the shortest pathfrom to in . This is a norm.

c

Benny Chor – p.4

Page 10: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

More Famous Distances

� � with � � �

is the “regular” Euclidean distance.

� � with � � �

(sum of absolute values ofdifferences).

� The

� norm: � �� ��� �

� � � � (the limit of

� as

� ).

� Let � ��

be a finite, undirected, connectedgraph,with positive edges’ lengths. Let �� �

be a pair of vertices.

Define the length of the shortest pathfrom to in . This is a norm.

c

Benny Chor – p.4

Page 11: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

More Famous Distances

� � with � � �

is the “regular” Euclidean distance.

� � with � � �

(sum of absolute values ofdifferences).

� The

� norm: � �� ��� �

� � � � (the limit of

� as

� ).

� Let � ��

be a finite, undirected, connectedgraph,with positive edges’ lengths. Let �� �

be a pair of vertices.

� Define

� �� � � the length of the shortest pathfrom to � in . This is a norm.

c

Benny Chor – p.4

Page 12: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

Distance vs. Similarity

� Distance and similarity are dual notions. If andare highly similar objects, than intuitively they

have small distance.

This intuition indeed holds for pairwise globalsequence alignment (see prob. 6 in assignment 1).

We can replace sequence similarity by distanceand obtain qualitatively similar results forpairwise global sequence alignment.

Dynamic programming can be used to findminimum distance alignment in time .

c

Benny Chor – p.5

Page 13: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

Distance vs. Similarity

� Distance and similarity are dual notions. If andare highly similar objects, than intuitively they

have small distance.

� This intuition indeed holds for pairwise globalsequence alignment (see prob. 6 in assignment 1).

We can replace sequence similarity by distanceand obtain qualitatively similar results forpairwise global sequence alignment.

Dynamic programming can be used to findminimum distance alignment in time .

c

Benny Chor – p.5

Page 14: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

Distance vs. Similarity

� Distance and similarity are dual notions. If andare highly similar objects, than intuitively they

have small distance.

� This intuition indeed holds for pairwise globalsequence alignment (see prob. 6 in assignment 1).

� We can replace sequence similarity by distanceand obtain qualitatively similar results forpairwise global sequence alignment.

Dynamic programming can be used to findminimum distance alignment in time .

c

Benny Chor – p.5

Page 15: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

Distance vs. Similarity

� Distance and similarity are dual notions. If andare highly similar objects, than intuitively they

have small distance.

� This intuition indeed holds for pairwise globalsequence alignment (see prob. 6 in assignment 1).

� We can replace sequence similarity by distanceand obtain qualitatively similar results forpairwise global sequence alignment.

� Dynamic programming can be used to findminimum distance alignment in time

��� � � � .

c

Benny Chor – p.5

Page 16: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

Local Sequence Alignment

� In local sequence alignment, we have two inputsequences (strings): The query

, and the text .

The goal is to find two substring: – substring ofand – substring of such that the (global)

alignment score between and is maximized(over all choices of pairs of substrings).

Notice that if , is an optimal alignment ofand , then , can contain indels.

c

Benny Chor – p.6

Page 17: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

Local Sequence Alignment

� In local sequence alignment, we have two inputsequences (strings): The query

, and the text .

� The goal is to find two substring: – substring of�

and – substring of such that the (global)alignment score between and is maximized(over all choices of pairs of substrings).

Notice that if , is an optimal alignment ofand , then , can contain indels.

c

Benny Chor – p.6

Page 18: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

Local Sequence Alignment

� In local sequence alignment, we have two inputsequences (strings): The query

, and the text .

� The goal is to find two substring: – substring of�

and – substring of such that the (global)alignment score between and is maximized(over all choices of pairs of substrings).

� Notice that if

,�

is an optimal alignment ofand , then

,�

can contain indels.

c

Benny Chor – p.6

Page 19: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

Local Alignment DP Algorithm

� The global and local alignment problems seemsvery different. But a rather small change in the”global” DP algorithm yields an efficient��� � � � ”local” DP algorithm.

Goal: Fill the matrix by values , thevalue of the best (global) alignment between allsuffixes of -prefix of and suffixes of -prefix of

.

c

Benny Chor – p.7

Page 20: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

Local Alignment DP Algorithm

� The global and local alignment problems seemsvery different. But a rather small change in the”global” DP algorithm yields an efficient��� � � � ”local” DP algorithm.

� Goal: Fill the � � � matrix by values

� ��

� �

, thevalue of the best (global) alignment between allsuffixes of

-prefix of�

and suffixes of

-prefix of.

c

Benny Chor – p.7

Page 21: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

Local Alignment DP Algorithm

� Initialize all boundary values (upper row, leftcolumn) to

.

Update rule:

Keep pointers like before (no pointer if).

Pick the highest entry in the matrix. Trace backto recover optimal local alignment(s).

c

Benny Chor – p.8

Page 22: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

Local Alignment DP Algorithm

� Initialize all boundary values (upper row, leftcolumn) to

.

� Update rule:

� � �

� � � � � � �

� � ��

� � � � � � � � ��

� � � � ��

� � �

� � � � ��

� � � � ��

� ��

� � � � � � � � � �� � ��

� ��

Keep pointers like before (no pointer if).

Pick the highest entry in the matrix. Trace backto recover optimal local alignment(s).

c

Benny Chor – p.8

Page 23: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

Local Alignment DP Algorithm

� Initialize all boundary values (upper row, leftcolumn) to

.

� Update rule:

� � �

� � � � � � �

� � ��

� � � � � � � � ��

� � � � ��

� � �

� � � � ��

� � � � ��

� ��

� � � � � � � � � �� � ��

� ��

� Keep pointers like before (no pointer if

� �� � �

).

Pick the highest entry in the matrix. Trace backto recover optimal local alignment(s).

c

Benny Chor – p.8

Page 24: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

Local Alignment DP Algorithm

� Initialize all boundary values (upper row, leftcolumn) to

.

� Update rule:

� � �

� � � � � � �

� � ��

� � � � � � � � ��

� � � � ��

� � �

� � � � ��

� � � � ��

� ��

� � � � � � � � � �� � ��

� ��

� Keep pointers like before (no pointer if

� �� � �

).

� Pick the highest entry in the matrix. Trace backto recover optimal local alignment(s).

c

Benny Chor – p.8

Page 25: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

Local Sequence Alignment IIFor example, suppose our sequences are� � �� � �� �� � �� �� �

and

� �� � � � � �� � �� � �� � �

.

Our scoring gives +2 for a match, -1 for amismatch,and -2 for indel.

A reasonable (best?) choice would be the substringsand , with the (global) alignment

between them being

_

In the context of the original sequences, we haveand

c

Benny Chor – p.9

Page 26: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

Local Sequence Alignment IIFor example, suppose our sequences are� � �� � �� �� � �� �� �

and

� �� � � � � �� � �� � �� � �

.

Our scoring gives +2 for a match, -1 for amismatch,and -2 for indel.

A reasonable (best?) choice would be the substringsand , with the (global) alignment

between them being

_

In the context of the original sequences, we haveand

c

Benny Chor – p.9

Page 27: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

Local Sequence Alignment IIFor example, suppose our sequences are� � �� � �� �� � �� �� �

and

� �� � � � � �� � �� � �� � �

.

Our scoring gives +2 for a match, -1 for amismatch,and -2 for indel.

A reasonable (best?) choice would be the substrings�� �� � ��

and

�� � �� � ��

, with the (global) alignmentbetween them being

_

In the context of the original sequences, we haveand

c

Benny Chor – p.9

Page 28: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

Local Sequence Alignment IIFor example, suppose our sequences are� � �� � �� �� � �� �� �

and

� �� � � � � �� � �� � �� � �

.

Our scoring gives +2 for a match, -1 for amismatch,and -2 for indel.

A reasonable (best?) choice would be the substrings�� �� � ��

and

�� � �� � ��

, with the (global) alignmentbetween them being

_

� �� � ��

�� � �� � ��

In the context of the original sequences, we haveand

c

Benny Chor – p.9

Page 29: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

Local Sequence Alignment IIFor example, suppose our sequences are� � �� � �� �� � �� �� �

and

� �� � � � � �� � �� � �� � �

.

Our scoring gives +2 for a match, -1 for amismatch,and -2 for indel.

A reasonable (best?) choice would be the substrings�� �� � ��

and

�� � �� � ��

, with the (global) alignmentbetween them being

_

� �� � ��

�� � �� � ��In the context of the original sequences, we have�� � �� �� � �� �� �

and�� � � � � �� � �� � �� � �

c

Benny Chor – p.9

Page 30: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

Other Versions of Alignment

� Global in

, local in .

Affine gap penalties.

Linear space algorithm.

c

Benny Chor – p.10

Page 31: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

Other Versions of Alignment

� Global in

, local in .

� Affine gap penalties.

Linear space algorithm.

c

Benny Chor – p.10

Page 32: Computational Genomics (0382.3102) Lecture 3 Sequence ...bchor/CG/Lecture3.pdfComputational Genomics (0382.3102) Lecture 3 Sequence Similarity and Pairwise Alignment II: Affine Gap

Other Versions of Alignment

� Global in

, local in .

� Affine gap penalties.

� Linear

��� � � space algorithm.

c

Benny Chor – p.10