cecs 694-02 introduction to bioinformatics university of louisville spring 2003 dr. eric rouchka...

104
CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka, D.Sc. [email protected] http://kbrin.kwing.louisville.edu/~rouchka/ CECS694/

Post on 20-Dec-2015

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Lecture 2:

Pairwise Sequence Alignment

Eric C. Rouchka, D.Sc.

[email protected]

http://kbrin.kwing.louisville.edu/~rouchka/CECS694/

Page 2: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Sequence Alignment

• Question: Are two sequences related?

• Compare the two sequences, see if they are similar

• Example: pear and tear

• Similar words, different meanings

Page 3: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Biological Sequences

• Similar biological sequences tend to be related

• Information:– Functional– Structural– Evolutionary

Page 4: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Biological sequences

• Common mistake: – sequence similarity is not homology!

• Homologous sequences: derived from a common ancestor

Page 5: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Relation of sequences

• Homologs: similar sequences in 2 different organisms derived from a common ancestor sequence.

• Orthologs: Similar sequences in 2 different organisms that have arisen due to a speciation event. Functionality Retained.

Page 6: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Relation of sequences

• Paralogs: Similar sequences within a single organism that have arisen due to a gene duplication event.

• Xenologs: similar sequences that have arisen out of horizontal transfer events (symbiosis, viruses, etc)

Page 7: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Relation of sequences

Image Source: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html

Page 8: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Edit Distance

• Sequence similarity: function of edit distance between two sequences

P E A R

| | |

T E A R

Page 9: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Hamming Distance

• Minimum number of letters by which two words differ

• Calculated by summing number of mismatches

• Hamming Distance between PEAR and TEAR is 1

Page 10: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Gapped Alignments

• Biological sequences– Different lengths– Regions of insertions and deletions

• Notion of gaps (denoted by ‘-’)

A L I G N M E N T | | | | | | |- L I G A M E N T

Page 11: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Possible Residue Alignments

• Match

• Mismatch (substitution or mutation)

• Insertion/Deletion (INDELS – gaps)

Page 12: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Alignments

• Which alignment is best?

A – C – G G – A C T| | | | |A T C G G A T _ C T

 A T C G G A T C T| | | | | |A – C G G – A C T

Page 13: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Alignment Scoring Scheme

• Possible scoring scheme:

match: +2

mismatch: -1

indel –2

• Alignment 1: 5 * 2 – 1(1) – 4(2) = 10 – 1 – 8 = 1

• Alignment 2: 6 * 2 – 1(1) – 2 (2) = 12 – 1 – 4 = 7

Page 14: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Alignment Methods

• Visual

• Brute Force

• Dynamic Programming

• Word-Based (k tuple)

Page 15: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Visual Alignments (Dot Plots)

• Matrix– Rows: Characters in one sequence– Columns: Characters in second sequence

• Filling– Loop through each row; if character in row, col

match, fill in the cell– Continue until all cells have been examined

Page 16: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Example Dot Plot

Page 17: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Noise in Dot Plots

• Nucleic Acids (DNA, RNA)– 1 out of 4 bases matches at random

• Stringency– Window size is considered– Percentage of bases matching in the

window is set as threshold

Page 18: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Reduction of Dot Plot Noise

Self alignment of ACCTGAGCTCACCTGAGTTA

Page 19: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Information Inside Dot Plots

• Regions of similarity: diagonals

• Insertions/deletions– Can determine intron/exon structure

• Repeats and Inverted Repeats– Inverted repeats = reverse complement– Used to determine folding of RNA molecules

Page 20: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Insertions/Deletions

Page 21: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Repeats/Inverted Repeats

Page 22: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Comparing Genome Assemblies

Page 23: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Chromosome Y self comparison

Page 24: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Available Dot Plot Programs

• Vector NTI software package (under AlignX)

Page 25: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Available Dot Plot Programs

• Vector NTI software package (under AlignX)

GCG software package:• Compare http://www.hku.hk/bruhk/gcgdoc/compare.html

• DotPlot+ http://www.hku.hk/bruhk/gcgdoc/dotplot.html • http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html• http://bioweb.pasteur.fr/cgi-bin/seqanal/dottup.pl • Dotter (http://www.cgr.ki.se/cgr/groups/sonnhammer/Dotter.html)

Page 26: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Available Dot Plot Programs

Dotlet (Java Applet) http://www.isrec.isb-sib.

ch/java/dotlet/Dotlet.html

Page 27: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Available Dot Plot ProgramsDotter (http://www.cgr.ki.se/cgr/groups/sonnhammer/Dotter.html)

Page 28: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Available Dot Plot ProgramsEMBOSS DotMatcher, DotPath,DotUp

Page 29: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Available Dot Plot Programs

GCG software package:• Compare http://www.hku.hk/bruhk/gcgdoc

/compare.html

• DotPlot+ http://www.hku.hk/bruhk/gcgdoc/dotplot.html 

• DNA strider• PipMaker

Page 30: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Dot Plot References

Gibbs, A. J. & McIntyre, G. A. (1970). The diagram method for comparing sequences. its use with amino acid and nucleotide sequences. Eur. J. Biochem. 16, 1-11.

 

 

Staden, R. (1982). An interactive graphics program for comparing and aligning nucleic-acid and amino-acid sequences. Nucl. Acid. Res. 10 (9), 2951-2961.

Page 31: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Determining Optimal Alignment

• Two sequences: X and Y– |X| = m; |Y| = n– Allowing gaps, |X| = |Y| = m+n

• Brute Force

• Dynamic Programming

Page 32: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Brute Force

• Determine all possible subsequences for X and Y– 2m+n subsequences for X, 2m+n for Y!

• Alignment comparisons– 2m+n * 2m+n = 2(2(m+n)) = 4m+n comparisons

• Quickly becomes impractical

Page 33: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Dynamic Programming

• Used in Computer Science

• Solve optimization problems by dividing the problem into independent subproblems

Page 34: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Dynamic Programming

• Sequence alignment has optimal substructure property– Subproblem: alignment of prefixes of two

sequences

– Each subproblem is computed once and stored in a matrix

Page 35: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Dynamic Programming

• Optimal score: built upon optimal alignment computed to that point

• Aligns two sequences beginning at ends, attempting to align all possible pairs of characters

Page 36: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Dynamic Programming

• Scoring scheme for matches, mismatches, gaps

• Highest set of scores defines optimal alignment between sequences

• Match score: DNA – exact match; Amino Acids – mutation probabilities

Page 37: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Dynamic Programming

• Guaranteed to provide optimal alignment given:– Two sequences– Scoring scheme

Page 38: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Steps in Dynamic Programming

        Initialization        Matrix Fill (scoring)        Traceback (alignment)

Page 39: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

DP Example

Sequence #1: GAATTCAGTTA; M = 11

Sequence #2: GGATCGA; N = 7

 

        s(aibj) = +5 if ai = bj (match score)

        s(aibj) = -3 if aibj (mismatch score)

        w = -4 (gap penalty)

Page 40: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

View of the DP Matrix

• M+1 rows, N+1 columns

Page 41: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Global Alignment(Needleman-Wunsch)

• Attempts to align all residues of two sequences

• INITIALIZATION: First row and first column set

• Si,0 = w * i

• S0,j = w * j

Page 42: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Initialized Matrix (Needleman-Wunsch)

Page 43: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Matrix Fill(Global Alignment)

Si,j = MAXIMUM[

Si-1, j-1 + s(ai,bj) (match/mismatch in the

diagonal),

Si,j-1 + w (gap in sequence #1),

Si-1,j + w (gap in sequence #2)

Page 44: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Matrix Fill (Global Alignment)

• S1,1 = MAX[S0,0 + 5, S1,0 - 4, S0,1 - 4] = MAX[5, -8, -8]

Page 45: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Matrix Fill (Global Alignment)

• S1,2 = MAX[S0,1 -3, S1,1 - 4, S0,2 - 4] = MAX[-4 - 3, 5 – 4, -8 – 4] =

MAX[-7, 1, -12] = 1

Page 46: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Matrix Fill (Global Alignment)

Page 47: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Filled Matrix (Global Alignment)

Page 48: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Trace Back (Global Alignment)

• maximum global alignment score = 11 (value in the lower right hand cell).

•  • Traceback begins in position SM,N; i.e. the

position where both sequences are globally aligned.

•  • At each cell, we look to see where we move

next according to the pointers.

Page 49: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Trace Back (Global Alignment)

Page 50: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Global Trace Back

G A A T T C A G T T A

| | | | | |

G G A – T C – G - — A

Page 51: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Checking Alignment Score

G A A T T C A G T T A| | | | | |G G A – T C – G - — A

 + - + - + + - + - - +5 3 5 4 5 5 4 5 4 4 5

 

5 – 3 + 5 – 4 + 5 + 5 – 4 + 5 – 4 – 4 + 5 = 11

Page 52: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Local Alignment

• Smith-Waterman: obtain highest scoring local match between two sequences

• Requires 2 modifications:– Negative scores for mismatches– When a value in the score matrix becomes

negative, reset it to zero (begin of new alignment)

Page 53: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Local Alignment Initialization

• Values in row 0 and column 0 set to 0.

Page 54: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Matrix Fill(Local Alignment)

Si,j = MAXIMUM[

Si-1, j-1 + s(ai,bj) (match/mismatch in the diagonal),

Si,j-1 + w (gap in sequence #1),

Si-1,j + w (gap in sequence #2),

0

]

Page 55: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Matrix Fill(Local Alignment)

• S1,1 = MAX[S0,0 + 5, S1,0 - 4, S0,1 – 4,0] = MAX[5, -4, -4, 0] = 5

Page 56: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Matrix Fill (Local Alignment)

• S1,2 = MAX[S0,1 -3, S1,1 - 4, S0,2 – 4, 0] = MAX[0 - 3, 5 – 4, 0 – 4,

0] = MAX[-3, 1, -4, 0] = 1

Page 57: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Matrix Fill (Local Alignment)S1,3 = MAX[S0,2 -3, S1,2 - 4, S0,3 – 4, 0] = MAX[0 - 3, 1 – 4, 0 – 4, 0] =

MAX[-3, -3, -4, 0] = 0

Page 58: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Filled Matrix(Local Alignment)

Page 59: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Trace Back (Local Alignment)

• maximum local alignment score for the two sequences is 14

• found by locating the highest values in the score matrix

• 14 is found in two separate cells, indicating multiple alignments producing the maximal alignment score

Page 60: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Trace Back (Local Alignment)

• Traceback begins in the position with the highest value.

• At each cell, we look to see where we move next according to the pointers

• When a cell is reached where there is not a pointer to a previous cell, we have reached the beginning of the alignment

Page 61: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Trace Back (Local Alignment)

Page 62: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Trace Back (Local Alignment)

Page 63: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Trace Back (Local Alignment)

Page 64: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Maximum Local Alignment

G A A T T C - A

| | | | |

G G A T – C G A

 

+ - + + - + - +

5 3 5 5 4 5 4 5

 

 

G A A T T C - A

| | | | |

G G A – T C G A

 

+ - + - + + - +

5 3 5 4 5 5 4 5

Page 65: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Scoring Matrices

• match/mismatch score – Not bad for similar sequences– Does not show distantly related sequences

• Likelihood matrix– Scores residues dependent upon likelihood

substitution is found in nature– More applicable for amino acid sequences

Page 66: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Percent Accepted Mutation (PAM or Dayhoff) Matrices

• Studied by Margaret Dayhoff

• Amino acid substitutions – Alignment of common protein sequences– 1572 amino acid substitutions– 71 groups of protein, 85% similar

• “Accepted” mutations – do not negatively affect a protein’s fitness

Page 67: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Percent Accepted Mutation (PAM or Dayhoff) Matrices

• Similar sequences organized into phylogenetic trees

• Number of amino acid changes counted

• Relative mutabilities evaluated

• 20 x 20 amino acid substitution matrix calculated

Page 68: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Percent Accepted Mutation (PAM or Dayhoff) Matrices

• PAM 1: 1 accepted mutation event per 100 amino acids; PAM 250: 250 mutation events per 100 …

• PAM 1 matrix can be multiplied by itself N times to give transition matrices for sequences that have undergone N mutations

• PAM 250: 20% similar; PAM 120: 40%; PAM 80: 50%; PAM 60: 60%

Page 69: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

PAM1 matrixnormalized probabilities multiplied by 10000

Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val A R N D C Q E G H I L K M F P S T W Y V A 9867 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0 2 18 R 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8 0 1 N 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1 4 1 D 6 0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5 3 0 0 1 C 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0 3 2 Q 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0 0 1 E 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 0 1 2 G 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0 0 5 H 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1 4 1 I 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0 1 33L 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 4 2 15 K 2 37 25 6 0 12 7 2 2 4 1 9926 20 0 3 8 11 0 1 1M 1 1 0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2 0 0 4 F 1 1 1 0 0 0 0 1 2 8 6 0 4 9946 0 2 1 3 28 0 P 13 5 2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0 0 2 S 28 11 34 7 11 4 6 16 2 2 1 7 4 3 17 9840 38 5 2 2 T 22 2 13 4 1 3 2 2 1 11 2 8 6 1 5 32 9871 0 2 9 W 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 9976 1 0 Y 1 0 3 0 3 0 1 0 4 1 1 0 0 21 0 1 1 2 9945 1 V 13 2 1 1 3 2 2 3 3 57 11 1 17 1 3 2 10 0 2 9901

Page 70: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Log Odds Matrices

• PAM matrices converted to log-odds matrix– Calculate odds ratio for each substitution

• Taking scores in previous matrix• Divide by frequency of amino acid

– Convert ratio to log10 and multiply by 10– Take average of log odds ratio for converting A to B and

converting B to A– Result: Symmetric matrix– EXAMPLE: Mount pp. 80-81

Page 71: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

PAM250 Log odds matrix

Page 72: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Blocks Amino Acid Substitution Matrices (BLOSUM)

• Larger set of sequences considered

• Sequences organized into signature blocks

• Consensus sequence formed– 60% identical: BLOSUM 60– 80% identical: BLOSUM 80

Page 73: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Nucleic Acid Scoring Matrices

• Two mutation models:– Uniform mutation rates (Jukes-Cantor)– Two separate mutation rates (Kimura)

• Transitions• Transversions

Page 74: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

DNA Mutations

A

C

G

T

PURINES: A, GPYRIMIDINES C, T

Transitions: AG; CTTransversions: AC, AT, CG, GT

Page 75: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

PAM1 DNA odds matrices

A. Model of uniform mutation rates among nucleotides. A G T C

A 0.99 G 0.00333 0.99 T 0.00333 0.00333 0.99 C 0.00333 0.00333 0.00333 0.99

B. Model of 3-fold higher transitions than transversions. A G T C

A 0.99 G 0.006 0.99 T 0.002 0.002 0.99 C 0.002 0.002 0.006 0.99

Page 76: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

PAM1 DNA log-odds matrices

A. Model of uniform mutation rates among nucleotides.A G T CA 2 G -6 2 T -6 -6 2 C -6 -6 -6 2

 B. Model of 3-fold higher transitions than transversions.A G T C

A 2 G -5 2 T -7 -7 2 C -7 -7 -5 2

Page 77: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Linear vs. Affine Gaps

• Gaps have been modeled as linear

• More likely contiguous block of residues inserted or deleted– 1 gap of length k rather than k gaps of length 1

• Scoring scheme should penalize new gaps more

Page 78: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Affine Gap Penalty

• wx = g + r(x-1)

• wx : total gap penalty; g: gap open penalty; r:

gap extend penalty;x: gap length• • gap penalty chosen relative to score matrix

– Gaps not excluded– Gaps not over included– Typical Values: g=-12; r = -4

Page 79: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Affine Gap Penalty and Dynamic Programming

Mi, j = max {

Di - 1, j - 1 + subst(Ai, Bj)

Mi - 1, j - 1 + subst(Ai, Bj)

Ii - 1, j - 1 + subst(Ai, Bj)

Di, j = max {

Di , j - 1 - extend

Mi , j - 1 - open

Ii, j = max {

Mi-1 , j - open

Ii-1 , j - extend

Page 80: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Drawbacks to DP Approaches

• Compute intensive

• Memory Intensive

• O(n2) space, between O(n2) and O(n3) time

Page 81: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Alternative DP approaches

• Linear space algorithms Myers-Miller

• Bounded Dynamic Programming

• Ewan Birney’s Dynamite Package– Automatic generation of DP code

Page 82: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Significance of Alignment

• Determine probability of alignment occurring at random– Sequence 1: length m– Sequence 2: length n

• Random sequences:– Alignment follows Gumbel Extreme Value

Distribution

Page 83: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Gumbel Extreme Value Distribution

• http://roso.epfl.ch/mbi/papers/discretechoice/node11.html

Page 84: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Probability of Alignment Score

• Expected # of alignments with score at least S (E-value):

E = Kmn e-λS

– m,n: Lengths of sequences– K ,λ: statistical parameters dependent

upon scoring system and background residue frequencies

Page 85: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Converting to Bit Scores

A raw score can be normalized to a bit score using the formula:

 

• The E-value corresponding to a given bit score can then be calculated as:

Page 86: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

P-Value

• P-Value: probability of obtaining a given score at random

P = 1 – e-E 

Which is approximately e-E

Page 87: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Significance of Ungapped Alignments

• PAM matrices are 10 * log10x

• Converting to log2x gives bits of information

• Converting to logex gives nats of information

Page 88: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Quick Calculation

• If bit scoring system is used, significance cutoff is:

log2(mn)

Page 89: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Example (p110)

• 2 Sequences, each 250 amino acids long

• Significance: – log2(250 * 250) = 16 bits

Page 90: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Example (p110)

• Using PAM250, the following alignment is found:

• F W L E V E G N S M T A P T G• F W L D V Q G D S M T A P A G

Page 91: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Example (p110)

• Using PAM250 (p82), the score is calculated:

• F W L E V E G N S M T A P T G• F W L D V Q G D S M T A P A G

• S = 9 + 17 + 6 + 3 + 4 + 2 + 5 + 2 + 2 + 6 + 3 + 2 + 6 + 1 + 5 = 73

Page 92: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Significance Example

• S is in 10 * log10x -- convert to a bit score:•  • S = 10 log10x

• S/10 = log10x

• S/10 = log10x * (log210/log210)

• S/10 * log210 = log10x / log210

• S/10 * log210 = log2x

• 1/3 S ~ log2x

•  • S’ ~ 1/3S

Page 93: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Significance Example

• S’ = 1/3S = 1/3 * 73 = 24.333 bits

• Significance cutoff = 16 bits

• Therefore, this alignment is significant

Page 94: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Estimation of E and P

• For PAM250, K = 0.09; = 0.229

• Using equations 30 and 31:

• S’ = 0.229 * 73 – ln 0.09 * 250 * 250

• S’ = 16.72 – 8.63 = 8.09 bits

• P(S’ >= 8.09) = 1 – e(-e-8.09) = 3.1* 10-4

Page 95: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Significance of Gapped Alignments

• Gapped alignments use same statistics

and K cannot be easily estimated

• Empirical estimations and gap scores determined by looking at random alignments

Page 96: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Bayesian Statistics

• Built upon conditional probabilities

• Used to derive joint probability of multiple events or conditions

Page 97: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Bayesian Statistics

• P(B|A): Probability of condition B given condition A is true

• P(B): Probability of condition B, regardless of condition A

• P(A, B): Joint probability of A and B occurring simultaneously

Page 98: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Bayesian Statistics

• A has two substates: A1, A2

• B has two substates: B1, B2

• P(B1) = 0.3 is known

• P(B2) = 1.0 – 0.3 = 0.7

• These are marginal probabilities

Page 99: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Joint Probabilities

• Bayes Theorem:

– P(A1,B1) = P(B1)P(A1|B1)– P(A1,B1) = P(A1)P(B1|A1)

Page 100: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Bayesian Example

• Given:– P(A1|B1) = 0.8; P(A2|B2) = 0.7

• Then– P(A2|B1) = 1.0 – 0.8 = 0.2; P(A1|B2) = 1.0 – 0.7 = 0.3

• AND– P(A1,B1) = P(B1)P(A1|B1) = 0.3 * 0.8 = 0.24– P(A2,B2) = P(B2)P(A2|B2) = 0.7 * 0.7 = 0.49– …

Page 101: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Posterior Probabilities

• Calculation of joint probabilities results in posterior probabilities

– Not known initially– Calculated using

• Prior probabilities• Initial information

Page 102: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Applications of Bayesian Statistics

• Evolutionary distance between two sequences (Mount, pp 122-124)

• Sequence Alignment (Mount, 124-134)

• Significance of Alignments (Durbin, 36-38)

• Gibbs Sampling (Covered later)

Page 103: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Pairwise Sequence Alignment Programs

• Blast 2 Sequences– NCBI– word based

sequence alignment

• LALIGN– FASTA package– Mult. Local

alignments

• needle– Global

Needleman/Wunsch alignment

• water– Local

Smith/Waterman alignment

Page 104: CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 2: Pairwise Sequence Alignment Eric C. Rouchka,

CECS 694-02 Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka

Various Sequence Alignments

Wise2 -- Genomic to protein

Sim4 -- Aligns expressed DNA to genomic sequence

spidey -- aligns mRNAs to genomic sequence

est2genome -- aligns ESTs to genomic sequence